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This lively and engaging textbook explains the things you have to know 
in order to read empirical papers in the social and health sciences, as well as 
the techniques you need to build statistical models of your own. The author, 
David A. Freedman, explains the basic ideas of association and regression, 
and takes you through the current models that link these ideas to causality. 

The focus is on applications of linear models, including generalized 
least squares and two-stage least squares, with probits and logits for binary 
variables. The bootstrap is developed as a technique for estimating bias and 
computing standard errors. Careful attention is paid to the principles of sta- 
tistical inference. There is background material on study design, bivariate re- 
gression, and matrix algebra. To develop technique, there are computer labs 
with sample computer programs. The book is rich in exercises, most with 
answers. 

Target audiences include advanced undergraduates and beginning grad- 
uate students in statistics, as well as students and professionals in the social 
and health sciences. The discussion in the book is organized around published 
studies, as are many of the exercises. Relevant journal articles are reprinted 
at the back of the book. Freedman makes a thorough appraisal of the statisti- 
cal methods in these papers and in a variety of other examples. He illustrates 
the principles of modeling, and the pitfalls. The discussion shows you how 
to think about the critical issues—including the connection (or lack of it) 
between the statistical models and the real phenomena. 


Features of the book 
e Authoritative guide by a well-known author with wide experience in teach- 
ing, research, and consulting 


e Will be of interest to anyone who deals with applied statistics 

e No-nonsense, direct style 

e Careful analysis of statistical issues that come up in substantive applica- 
tions, mainly in the social and health sciences 


e Can be used as a text in a course or read on its own 

e Developed over many years at Berkeley, thoroughly class tested 
e Background material on regression and matrix algebra 

e Plenty of exercises 


e Extra material for instructors, including data sets and MATLAB code for 
lab projects (send email to solutions @cambridge.org) 
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Foreword to the Revised Edition 


Some books are correct. Some are clear. Some are useful. Some are 
entertaining. Few are even two of these. This book is all four. Statistical 
Models: Theory and Practice is lucid, candid and insightful, a joy to read. 
We are fortunate that David Freedman finished this new edition before his 
death in late 2008. We are deeply saddened by his passing, and we greatly 
admire the energy and cheer he brought to this volume—and many other 
projects—during his final months. 

This book focuses on half a dozen of the most common tools in applied 
Statistics, presenting them crisply, without jargon or hyperbole. It dissects 
real applications: a quarter of the book reprints articles from the social and 
life sciences that hinge on statistical models. It articulates the assumptions 
necessary for the tools to behave well and identifies the work that the as- 
sumptions do. This clarity makes it easier for students and practitioners to 
see where the methods will be reliable; where they are likely to fail, and 
how badly; where a different method might work; and where no inference is 
possible—no matter what tool somebody tries to sell them. 

Many texts at this level are little more than bestiaries of methods, pre- 
senting dozens of tools with scant explication or insight, a cookbook, 
numbers-are-numbers approach. “If the left hand side is continuous, use a 
linear model; fit by least-squares. If the left hand side is discrete, use a logit 
or probit model; fit by maximum likelihood.” Presenting statistics this way 
invites students to believe that the resulting parameter estimates, standard 
errors, and tests of significance are meaningful—perhaps even untangling 
complex causal relationships. They teach students to think scientific infer- 
ence is purely algorithmic. Plug in the numbers; out comes science. This 
undervalues both substantive and statistical knowledge. 

To select an appropriate statistical method actually requires careful 
thought about how the data were collected and what they measure. Data 
are not “just numbers.” Using statistical methods in situations where the un- 
derlying assumptions are false can yield gold or dross—but more often dross. 

Statistical Models brings this message home by showing both good and 
questionable applications of statistical tools in landmark research: a study 
of political intolerance during the McCarthy period, the effect of Catholic 
schooling on completion of high school and entry into college, the relation- 
ship between fertility and education, and the role of government institutions 
in shaping social capital. Other examples are drawn from medicine and 
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epidemiology, including John Snow’s classic work on the cause of cholera— 
a shining example of the success of simple statistical tools when paired with 
substantive knowledge and plenty of shoe leather. These real applications 
bring the theory to life and motivate the exercises. 

The text is accessible to upper-division undergraduates and beginning 
graduate students. Advanced graduate students and established researchers 
will also find new insights. Indeed, the three of us have learned much by 
reading it and teaching from it. 

And those who read this textbook have not exhausted Freedman’s ap- 
proachable work on these topics. Many of his related research articles are 
collected in Statistical Models and Causal Inference: A Dialogue with the 
Social Sciences (Cambridge University Press, 2009), a useful companion to 
this text. The collection goes further into some applications mentioned in the 
textbook, such as the etiology of cholera and the health effects of Hormone 
Replacement Therapy. Other applications range from adjusting the census 
for undercount to quantifying earthquake risk. Several articles address the- 
oretical issues raised in the textbook. For instance, randomized assignment 
in an experiment is not enough to justify regression: without further assump- 
tions, multiple regression estimates of treatment effects are biased. The col- 
lection also covers the philosophical foundations of statistics and methods 
the textbook does not, such as survival analysis. 

Statistical Models: Theory and Practice presents serious applications 
and the underlying theory without sacrificing clarity or accessibility. Freed- 
man shows with wit and clarity how statistical analysis can inform and how 
it can deceive. This book is unlike any other, a treasure: an introductory 
book that conveys some of the wisdom required to make reliable statistical 
inferences. It is an important part of Freedman’s legacy. 


David Collier, Jasjeet Singh Sekhon, and Philip B. Stark 
University of California, Berkeley 


Preface 


This book is primarily intended for advanced undergraduates or begin- 
ning graduate students in statistics. It should also be of interest to many 
students and professionals in the social and health sciences. Although writ- 
ten as a textbook, it can be read on its own. The focus is on applications of 
linear models, including generalized least squares, two-stage least squares, 
probits and logits. The bootstrap is explained as a technique for estimating 
bias and computing standard errors. 

The contents of the book can fairly be described as what you have to 
know in order to start reading empirical papers that use statistical models. The 
emphasis throughout is on the connection—or lack of connection—between 
the models and the real phenomena. Much of the discussion is organized 
around published studies; the key papers are reprinted for ease of reference. 
Some observers may find the tone of the discussion too skeptical. If you 
are among them, I would make an unusual request: suspend belief until you 
finish reading the book. (Suspension of disbelief is all too easily obtained, 
but that is a topic for another day.) 

The first chapter contrasts observational studies with experiments, and 
introduces regression as a technique that may help to adjust for confounding 
in observational studies. There is a chapter that explains the regression line, 
and another chapter with a quick review of matrix algebra. (At Berkeley, half 
the statistics majors need these chapters.) The going would be much easier 
with students who know such material. Another big plus would be a solid 
upper-division course introducing the basics of probability and statistics. 

Technique is developed by practice. At Berkeley, we have lab sessions 
where students use the computer to analyze data. There is a baker’s dozen of 
these labs at the back of the book, with outlines for several more, and there 
are sample computer programs. Data are available to instructors from the 
publisher, along with source files for the labs and computer code: send email 
to solutions @cambridge.org. 

A textbook is only as good as its exercises, and there are plenty of them 
in the pages that follow. Some are mathematical and some are hypothetical, 
providing the analogs of lemmas and counter-examples in a more conven- 
tional treatment. On the other hand, many of the exercises are based on 
actual studies. Here is a summary of the data and the analysis; here is a 
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specific issue: where do you come down? Answers to most of the exercises 
are at the back of the book. Beyond exercises and labs, students at Berkeley 
write papers during the semester. Instructions for projects are also available 
from the publisher. 

A text is defined in part by what it chooses to discuss, and in part by 
what it chooses to ignore; the topics of interest are not to be covered in one 
book, no matter how thick. My objective was to explain how practitioners 
infer causation from association, with the bootstrap as a counterpoint to the 
usual asymptotics. Examining the logic of the enterprise is crucial, and that 
takes time. If a favorite technique has been slighted, perhaps this reasoning 
will make amends. 

There is enough material in the book for 15-20 weeks of lectures and 
discussion at the undergraduate level, or 10-15 weeks at the graduate level. 
With undergraduates on the semester system, I cover chapters 1-7, and in- 
troduce simultaneity (sections 9.1—4). This usually takes 13 weeks. If things 
go quickly, I do the bootstrap (chapter 8), and the examples in chapter 9. 
On a quarter system with ten-week terms, I would skip the student presenta- 
tions and chapters 8-9; the bivariate probit model in chapter 7 could also be 
dispensed with. 

During the last two weeks of a semester, students present their projects, 
or discuss them with me in office hours. I often have a review period on 
the last day of class. For a graduate course, I supplement the material with 
additional case studies and discussion of technique. 

The revised text organizes the chapters somewhat differently, which 
makes the teaching much easier. The exposition has been improved in a 
number of other ways, without (I hope) introducing new difficulties. There 
are many new examples and exercises. 


Acknowledgements 


I’ve taught graduate and undergraduate courses based on this material for 
many years at Berkeley, and on occasion at Stanford and Athens. The students 
in those courses were helpful and supportive. I would also like to thank Dick 
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Diaconis, Thad Dunning, Mike Finkelstein, Paul Humphreys, Jon McAuliffe, 
Doug Rivers, Mike Roberts, Don Ylvisaker, and PengZhao, along with several 
anonymous reviewers, for many useful comments. Russ Lyons and Roger 
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Observational Studies and Experiments 


1.1 Introduction 


This book is about regression models and variants like path models, 
simultaneous-equation models, logits and probits. Regression models can be 
used for different purposes: 


(i) to summarize data, 
(i) to predict the future, 
(iii) to predict the results of interventions. 


The third—causal inference—is the most interesting and the most slippery. It 
will be our focus. For background, this section covers some basic principles 
of study design. 

Causal inferences are made from observational studies, natural exper- 
iments, and randomized controlled experiments. When using observational 
(non-experimental) data to make causal inferences, the key problem is con- 
founding. Sometimes this problem is handled by subdividing the study pop- 
ulation (stratification, also called cross-tabulation), and sometimes by mod- 
eling. These strategies have various strengths and weaknesses, which need 
to be explored. 
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In medicine and social science, causal inferences are most solid when 
based on randomized controlled experiments, where investigators assign sub- 
jects at random—by the toss of a coin—to a treatment group or to a control 
group. Up to random error, the coin balances the two groups with respect to 
all relevant factors other than treatment. Differences between the treatment 
group and the control group are therefore due to treatment. That is why causa- 
tion is relatively easy to infer from experimental data. However, experiments 
tend to be expensive, and may be impossible for ethical or practical reasons. 
Then statisticians turn to observational studies. 

In an observational study, it is the subjects who assign themselves to 
the different groups. The investigators just watch what happens. Studies on 
the effects of smoking, for instance, are necessarily observational. However, 
the treatment-control terminology is still used. The investigators compare 
smokers (the treatment group, also called the exposed group) with nonsmokers 
(the control group) to determine the effect of smoking. The jargon is a little 
confusing, because the word “control” has two senses: 


(i) acontrol is a subject who did not get the treatment; 
(ii) a controlled experiment is a study where the investigators decide 
who will be in the treatment group. 


Smokers come off badly in comparison with nonsmokers. Heart attacks, 
lung cancer, and many other diseases are more common among smokers. 
There is a strong association between smoking and disease. If cigarettes 
cause disease, that explains the association: death rates are higher for smokers 
because cigarettes kill. Generally, association is circumstantial evidence for 
causation. However, the proof is incomplete. There may be some hidden 
confounding factor which makes people smoke and also makes them sick. 
If so, there is no point in quitting: that will not change the hidden factor. 
Association is not the same as causation. 


Confounding means a difference between the treatment and con- 
trol groups—other than the treatment—which affects the response 
being studied. 


Typically, a confounder is a third variable which is associated with exposure 
and influences the risk of disease. 

Statisticians like Joseph Berkson and R. A. Fisher did not believe the 
evidence against cigarettes, and suggested possible confounding variables. 
Epidemiologists (including Richard Doll and Bradford Hill in England, as 
well as Wynder, Graham, Hammond, Horn, and Kahn in the United States) 
ran careful observational studies to show these alternative explanations were 
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not plausible. Taken together, the studies make a powerful case that smoking 
causes heart attacks, lung cancer, and other diseases. If you give up smoking, 
you will live longer. 

Epidemiological studies often make comparisons separately for smaller 
and more homogeneous groups, assuming that within these groups, subjects 
have been assigned to treatment or control as if by randomization. For ex- 
ample, a crude comparison of death rates among smokers and nonsmokers 
could be misleading if smokers are disproportionately male, because men are 
more likely than women to have heart disease and cancer. Gender is there- 
fore a confounder. To control for this confounder—a third use of the word 
“control”—epidemiologists compared male smokers to male nonsmokers, 
and females to females. 

Age is another confounder. Older people have different smoking habits, 
and are more at risk for heart disease and cancer. So the comparison between 
smokers and nonsmokers was made separately by gender and age: for ex- 
ample, male smokers age 55-59 were compared to male nonsmokers in the 
same age group. This controls for gender and age. Air pollution would be 
a confounder, if air pollution causes lung cancer and smokers live in more 
polluted environments. To control for this confounder, epidemiologists made 
comparisons separately in urban, suburban, and rural areas. In the end, expla- 
nations for health effects of smoking in terms of confounders became very, 
very implausible. 

Of course, as we control for more and more variables this way, study 
groups get smaller and smaller, leaving more and more room for chance 
effects. This is a problem with cross-tabulation as a method for dealing with 
confounders, and a reason for using statistical models. Furthermore, most 
observational studies are less compelling than the ones on smoking. The 
following (slightly artificial) example illustrates the problem. 


Example 1. In cross-national comparisons, there is a striking correlation 
between the number of telephone lines per capita in a country and the death 
rate from breast cancer in that country. This is not because talking on the 
telephone causes cancer. Richer countries have more phones and higher 
cancer rates. The probable explanation for the excess cancer risk is that 
women in richer countries have fewer children. Pregnancy—especially early 
first pregnancy—is protective. Differences in diet and other lifestyle factors 
across countries may also play some role. 


Randomized controlled experiments minimize the problem of con- 
founding. That is why causal inferences from randomized con- 
trolled experiments are stronger than those from observational stud- 
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ies. With observational studies of causation, you always have to 
worry about confounding. What were the treatment and control 
groups? How were they different, apart from treatment? What 
adjustments were made to take care of the differences? Are these 
adjustments sensible? 


The rest of this chapter will discuss examples: the HIP trial of mammography, 
Snow on cholera, and the causes of poverty. 


1.2 The HIP trial 


Breast cancer is one of the most common malignancies among women in 
Canada and the United States. If the cancer is detected early enough—before 
it spreads—chances of successful treatment are better. “Mammography” 
means screening women for breast cancer by X-rays. Does mammography 
speed up detection by enough to matter? The first large-scale randomized 
controlled experiment was HIP (Health Insurance Plan) in New York, followed 
by the Two-County study in Sweden. There were about half a dozen other 
trials as well. Some were negative (screening doesn’t help) but most were 
positive. By the late 1980s, mammography had gained general acceptance. 

The HIP study was done in the early 1960s. HIP was a group medical 
practice which had at the time some 700,000 members. Subjects in the experi- 
ment were 62,000 women age 40-64, members of HIP, who were randomized 
to treatment or control. “Treatment” consisted of invitation to 4 rounds of 
annual screening—a clinical exam and mammography. The control group 
continued to receive usual health care. Results from the first 5 years of fol- 
lowup are shown in table 1. In the treatment group, about 2/3 of the women 
accepted the invitation to be screened, and 1/3 refused. Death rates (per 1000 
women) are shown, so groups of different sizes can be compared. 


Table 1. HIP data. Group sizes (rounded), deaths in 5 years of 
followup, and death rates per 1000 women randomized. 


Group Breast cancer All other 
size No. Rate No. Rate 
Treatment 
Screened 20,200 231.1 428 21 
Refused 10,800 16 1.5 409 38 
Total 31,000 39 13 837 27 


Control 31,000 63 20 879 28 
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Which rates show the efficacy of treatment? It seems natural to compare 
those who accepted screening to those who refused. However, this is an ob- 
servational comparison, even though it occurs in the middle of an experiment. 
The investigators decided which subjects would be invited to screening, but 
it is the subjects themselves who decided whether or not to accept the invita- 
tion. Richer and better-educated subjects were more likely to participate than 
those who were poorer and less well educated. Furthermore, breast cancer 
(unlike most other diseases) hits the rich harder than the poor. Social status 
is therefore a confounder—a factor associated with the outcome and with the 
decision to accept screening. 

The tip-off is the death rate from other causes (not breast cancer) in the 
last column of table 1. There is a big difference between those who accept 
screening and those who refuse. The refusers have almost double the risk of 
those who accept. There must be other differences between those who accept 
screening and those who refuse, in order to account for the doubling in the 
risk of death from other causes—because screening has no effect on the risk. 

One major difference is social status. It is the richer women who come 
in for screening. Richer women are less vulnerable to other diseases but more 
vulnerable to breast cancer. So the comparison of those who accept screening 
with those who refuse is biased, and the bias is against screening. 

Comparing the death rate from breast cancer among those who accept 
screening and those who refuse is analysis by treatment received. This analy- 
sis is seriously biased, as we have just seen. The experimental comparison is 
between the whole treatment group—all those invited to be screened, whether 
or not they accepted screening—and the whole control group. This is the 
intention-to-treat analysis. 


Intention-to-treat is the recommended analysis. 


HIP, which was a very well-run study, made the intention-to-treat analysis. 
The investigators compared the breast cancer death rate in the total treatment 
group to the rate in the control group, and showed that screening works. 

The effect of the invitation is small in absolute terms: 63 — 39 = 24 
lives saved (table 1). Since the absolute risk from breast cancer is small, no 
intervention can have a large effect in absolute terms. On the other hand, in 
relative terms, the 5-year death rates from breast cancer are in the ratio 39/63 = 
62%. Followup continued for 18 years, and the savings in lives persisted 
over that period. The Two-County study—a huge randomized controlled 
experiment in Sweden—confirmed the results of HIP. So did other studies 
in Finland, Scotland, and Sweden. That is why mammography became so 
widely accepted. 
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1.3 Snow on cholera 


A natural experiment is an observational study where assignment to 
treatment or control is as if randomized by nature. In 1855, some twenty 
years before Koch and Pasteur laid the foundations of modern microbiology, 
John Snow used a natural experiment to show that cholera is a waterborne 
infectious disease. At the time, the germ theory of disease was only one 
of many theories. Miasmas (foul odors, especially from decaying organic 
material) were often said to cause epidemics. Imbalance in the humors of the 
body—black bile, yellow bile, blood, phlegm—was an older theory. Poison 
in the ground was an explanation that came into vogue slightly later. 

Snow was a physician in London. By observing the course of the disease, 
he concluded that cholera was caused by a living organism which entered the 
body with water or food, multiplied in the body, and made the body expel 
water containing copies of the organism. The dejecta then contaminated food 
or reentered the water supply, and the organism proceeded to infect other 
victims. Snow explained the lag between infection and disease—a matter of 
hours or days—as the time needed for the infectious agent to multiply in the 
body of the victim. This multiplication is characteristic of life: inanimate 
poisons do not reproduce themselves. (Of course, poisons may take some 
time to do their work: the lag is not compelling evidence.) 

Snow developed a series of arguments in support of the germ theory. For 
instance, cholera spread along the tracks of human commerce. Furthermore, 
when a ship entered a port where cholera was prevalent, sailors contracted the 
disease only when they came into contact with residents of the port. These 
facts were easily explained if cholera was an infectious disease, but were hard 
to explain by the miasma theory. 

There was a cholera epidemic in London in 1848. Snow identified the 
first or “index” case in this epidemic: 


“a seaman named John Harnold, who had newly arrived by the Elbe 
steamer from Hamburgh, where the disease was prevailing.” [p. 3] 


He also identified the second case: a man named Blenkinsopp who took 
Harnold’s room after the latter died, and became infected by contact with the 
bedding. Next, Snow was able to find adjacent apartment buildings, one hard 
hit by cholera and one not. In each case, the affected building had a water 
supply contaminated by sewage, the other had relatively pure water. Again, 
these facts are easy to understand if cholera is an infectious disease—but not 
if miasmas are the cause. 

There was an outbreak of the disease in August and September of 1854. 
Snow made a “spot map,” showing the locations of the victims. These clus- 
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tered near the Broad Street pump. (Broad Street is in Soho, London; at the 
time, public pumps were used as a source of drinking water.) By contrast, 
there were a number of institutions in the area with few or no fatalities. One 
was a brewery. The workers seemed to have preferred ale to water; if any 
wanted water, there was a private pump on the premises. Another institution 
almost free of cholera was a poor-house, which too had its own private pump. 
(Poor-houses will be discussed again, in section 4.) 

People in other areas of London did contract the disease. In most cases, 
Snow was able to show they drank water from the Broad Street pump. For 
instance, one lady in Hampstead so much liked the taste that she had water 
from the Broad Street pump delivered to her house by carter. 

So far, we have persuasive anecdotal evidence that cholera is an infec- 
tious disease, spread by contact or through the water supply. Snow also used 
statistical ideas. There were a number of water companies in the London of 
his time. Some took their water from heavily contaminated stretches of the 
Thames river. For others, the intake was relatively uncontaminated. 

Snow made “ecological” studies, correlating death rates from cholera in 
various areas of London with the quality of the water. Generally speaking, 
areas with contaminated water had higher death rates. The Chelsea water 
company was exceptional. This company started with contaminated water, 
but had quite modern methods of purification—with settling ponds and careful 
filtration. Its service area had a low death rate from cholera. 

In 1852, the Lambeth water company moved its intake pipe upstream 
to get purer water. The Southwark and Vauxhall company left its intake pipe 
where it was, in a heavily contaminated stretch of the Thames. Snow made 
an ecological analysis comparing the areas serviced by the two companies in 
the epidemics of 1853-54 and in earlier years. Let him now continue in his 
own words. 


“Although the facts shown in the above table [the ecological analysis] 
afford very strong evidence of the powerful influence which the drinking of 
water containing the sewage of a town exerts over the spread of cholera, when 
that disease is present, yet the question does not end here; for the intermixing 
of the water supply of the Southwark and Vauxhall Company with that of 
the Lambeth Company, over an extensive part of London, admitted of the 
subject being sifted in such a way as to yield the most incontrovertible proof 
on one side or the other. In the subdistricts enumerated in the above table 
as being supplied by both Companies, the mixing of the supply is of the 
most intimate kind. The pipes of each Company go down all the streets, 
and into nearly all the courts and alleys. A few houses are supplied by one 
Company and a few by the other, according to the decision of the owner or 
occupier at that time when the Water Companies were in active competition. 
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In many cases a single house has a supply different from that on either side. 
Each company supplies both rich and poor, both large houses and small; 
there is no difference either in the condition or occupation of the persons 
receiving the water of the different Companies. Now it must be evident that, 
if the diminution of cholera, in the districts partly supplied with improved 
water, depended on this supply, the houses receiving it would be the houses 
enjoying the whole benefit of the diminution of the malady, whilst the houses 
supplied with the [contaminated] water from Battersea Fields would suffer 
the same mortality as they would if the improved supply did not exist at all. 
As there is no difference whatever in the houses or the people receiving the 
supply of the two Water Companies, or in any of the physical conditions 
with which they are surrounded, it is obvious that no experiment could have 
been devised which would more thoroughly test the effect of water supply 
on the progress of cholera than this, which circumstances placed ready made 
before the observer. 

“The experiment, too, was on the grandest scale. No fewer than three 
hundred thousand people of both sexes, of every age and occupation, and of 
every rank and station, from gentlefolks down to the very poor, were divided 
into groups without their choice, and in most cases, without their knowledge; 
one group being supplied with water containing the sewage of London, and 
amongst it, whatever might have come from the cholera patients; the other 
group having water quite free from such impurity. 

“To turn this grand experiment to account, all that was required was 
to learn the supply of water to each individual house where a fatal attack of 
cholera might occur.” [pp. 74-75] 


Snow’s data are shown in table 2. The denominator data—the number of 
houses served by each water company—were available from parliamentary 
records. For the numerator data, however, a house-to-house canvass was 
needed to determine the source of the water supply at the address of each 
cholera fatality. (The “bills of mortality,’ as death certificates were called at 
the time, showed the address but not the water source for each victim.) The 
death rate from the Southwark and Vauxhall water is about 9 times the death 
rate for the Lambeth water. Snow explains that the data could be analyzed as 


Table 2. Death rate from cholera by source of water. Rate per 
10,000 houses. London. Epidemic of 1854. Snow’s table IX. 


No. of Houses Cholera Deaths Rate per 10,000 


Southwark & Vauxhall 40,046 1,263 315 
Lambeth 26,107 98 37 
Rest of London 256,423 1,422 59 
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if they had resulted from a randomized controlled experiment: there was no 
difference between the customers of the two water companies, except for the 
water. The data analysis is simple—a comparison of rates. It is the design of 
the study and the size of the effect that compel conviction. 


1.4 Yule on the causes of poverty 


Legendre (1805) and Gauss (1809) developed regression techniques to 
fit data on orbits of astronomical objects. The relevant variables were known 
from Newtonian mechanics, and so were the functional forms of the equations 
connecting them. Measurement could be done with high precision. Much 
was known about the nature of the errors in the measurements and equations. 
Furthermore, there was ample opportunity for comparing predictions to real- 
ity. A century later, investigators were using regression on social science data 
where these conditions did not hold, even to a rough approximation—with 
consequences that need to be explored (chapters 4-9). 

Yule (1899) was studying the causes of poverty. At the time, paupers 
in England were supported either inside grim Victorian institutions called 
“poor-houses” or outside, depending on the policy of local authorities. Did 
policy choices affect the number of paupers? To study this question, Yule 
proposed a regression equation, 


(1) APaup = a + bx AOut + c x AOld + d x APop + error. 


In this equation, 


A is percentage change over time, 
Paup is the percentage of paupers, 
Out is the out-relief ratio N/D, 
N = number on welfare outside the poor-house, 
D = number inside, 
Old is the percentage of the population aged over 65, 
Pop is the population. 
Data are from the English Censuses of 1871, 1881, 1891. There are two A’s, 
one for 1871-81 and one for 1881-91. (Error terms will be discussed later.) 
Relief policy was determined separately in each “union” (an administra- 
tive district comprising several parishes). At the time, there were about 600 
unions, and Yule divided them into four kinds: rural, mixed, urban, metropol- 
itan. There are 4x2 = 8 equations, one for each type of union and time period. 
Yule fitted his equations to the data by least squares. That is, he determined 
a, b, c, and d by minimizing the sum of squared errors, 


y (APaup — a — b x AOut — c x AOld — d x APop)’. 
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The sum is taken over all unions of a given type in a given time period, which 
assumes (in effect) that coefficients are constant for those combinations of 
geography and time. 


Table 3. Pauperism, Out-relief ratio, Proportion of Old, Population. 
Ratio of 1881 data to 1871 data, times 100. Metropolitan Unions, 
England. Yule (1899, table XTX). 


Kensington 27 5 104 136 
Paddington 47 12 115 lll 
Fulham 31 21 85 174 
Chelsea 64 21 81 124 
St. George’s 46 18 113 96 
Westminster 52 27 105 91 
Marylebone 81 36 100 97 
St. John, Hampstead 61 39 103 141 
St. Pancras 61 35 101 107 
Islington 59 35 101 132 
Hackney 33 22 91 150 
St. Giles’ 76 30 103 85 
Strand 64 27 97 81 
Holborn 79 33 95 93 
City 79 64 113 68 
Shoreditch 52 21 108 100 
Bethnal Green 46 19 102 106 
Whitechapel 35 6 93 93 
St. George’s East 37 6 98 98 
Stepney 34 10 87 101 
Mile End 43 15 102 113 
Poplar 37 20 102 135 
St. Saviour’s 52 22; 100 111 
St. Olave’s 57 32 102 110 
Lambeth 57 38 99 122 
Wandsworth 23 18 91 168 
Camberwell 30 14 83 168 
Greenwich 55 37 94 131 
Lewisham 41 24 100 142 
Woolwich 76 20 119 110 
Croydon 38 29 101 142 


West Ham 38 49 86 203 
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For example, consider the metropolitan unions. Fitting the equation to 
the data for 1871-81, Yule got 


(2) APaup = 13.19 + 0.755A0ut — 0.022A0Old — 0.322 APop + error. 
For 1881-91, his equation was 
(3) APaup = 1.36 + 0.324AOut + 1.37 AOld — 0.369 APop + error. 


The coefficient of AOut being relatively large and positive, Yule concludes 
that out-relief causes poverty. 

Let’s take a look at some of the details. Table 3 has the ratio of 1881 
data to 1871 data for Pauperism, Out-relief ratio, Proportion of Old, and 
Population. If we subtract 100 from each entry in the table, column 1 gives 
APaup in the regression equation (2); columns 2, 3, 4 give the other variables. 
For Kensington (the first union in the table), 


AOut = 5 — 100 = —95, AOld = 104 — 100 = 4, APop = 136 — 100 = 36. 
The predicted value for APaup from (2) is therefore 
13.19 + 0.755 x (—95) — 0.022 x 4 — 0.322 x 36 = —70. 


The actual value for APaup is —73. So the error is —3. As noted before, the 
coefficients were chosen by Yule to minimize the sum of squared errors. (In 
chapter 4, we will see how to do this.) 

Look back at equation (2). The causal interpretation of the coefficient 
0.755 is this. Other things being equal, if AOut is increased by 1 percent- 
age point—the administrative district supports more people outside the poor- 
house—then APaup will go up by 0.755 percentage points. This is a quan- 
titative inference. Out-relief causes an increase in pauperism—a qualitative 
inference. The point of introducing APop and AOld into the equation is to 
control for possible confounders, implementing the idea of “other things be- 
ing equal.” For Yule’s argument, it is important that the coefficient of AOut 
be significantly positive. Qualitative inferences are often the important ones; 
with regression, the two aspects are woven together. 

Quetelet (1835) wanted to uncover “social physics” —the laws of human 
behavior—by using statistical technique. Yule was using regression to infer 
the social physics of poverty. But this is not so easily to be done. Confounding 
is one problem. According to Pigou, a leading welfare economist of Yule’s 
era, districts with more efficient administrations were building poor-houses 
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and reducing poverty. Efficiency of administration is then a confounder, 
influencing both the presumed cause and its effect. Economics may be another 
confounder. Yule occasionally describes the rate of population change as a 
proxy for economic growth. Generally, however, he pays little attention to 
economics. The explanation: 


“A good deal of time and labour was spent in making trial of this idea, but 
the results proved unsatisfactory, and finally the measure was abandoned 
altogether.” [p. 253] 


The form of Yule’s equation is somewhat arbitrary, and the coefficients 
are not consistent across time and geography: compare equations (2) and (3) 
to see differences across time. Differences across geography are reported 
in table C of Yule’s paper. The inconsistencies may not be fatal. However, 
unless the coefficients have some existence of their own—apart from the 
data—how can they predict the results of interventions that would change the 
data? The distinction between parameters and estimates is a basic one, and 
we will return to this issue several times in chapters 4-9. 

There are other problems too. At best, Yule has established association. 
Conditional on the covariates, there is a positive association between APaup 
and AOut. Is this association causal? If so, which way do the causal arrows 
point? For instance, a parish may choose not to build poor-houses in response 
to a short-term increase in the number of paupers, in which case pauperism 
causes out-relief. Likewise, the number of paupers in one area may well be 
affected by relief policy in neighboring areas. Such issues are not resolved 
by the data analysis. Instead, answers are assumed a priori. Yule’s enterprise 
is substantially more problematic than Snow on cholera, or the HIP trial, or 
the epidemiology of smoking. 

Yule was aware of the problems. Although he was busily parceling out 
changes in pauperism—so much is due to changes in the out-relief ratio, so 
much to changes in other variables, and so much to random effects—there 
is one deft footnote (number 25) that withdraws all causal claims: “Strictly 
speaking, for ‘due to’ read ‘associated with.’” 

Yule’s approach is strikingly modern, except there is no causal diagram 
with stars to indicate statistical significance. Figure 1 brings him up to date. 
The arrow from AOut to APaup indicates that AOut is included in the regres- 
sion equation explaining APaup. “Statistical significance” is indicated by an 
asterisk, and three asterisks signal a high degree of significance. The idea 
is that a statistically significant coefficient differs from zero, so that AOut 
has a causal influence on APaup. By contrast, an insignificant coefficient is 
considered to be zero: e.g., AOld does not have a causal influence on APaup. 
We return to these issues in chapter 6. 
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Figure 1. Yule’s model. Metropolitan unions, 1871-81. 
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Yule could have used regression to summarize his data: for a given time 
period and unions of a specific type, with certain values of the explanatory 
variables, the change in pauperism was about so much and so much. In other 
words, he could have used his equations to approximate the average value of 
APaup, given the values of AOut, AOld, APop. This assumes linearity. If we 
turn to prediction, there is another assumption: the system will remain stable 
over time. Prediction is already more complicated than description. On the 
other hand, if we make a series of predictions and test them against data, it 
may be possible to show that the system is stable enough for regression to 
be helpful. 

Causal inference is different, because a change in the system is contem- 
plated—an intervention. Descriptive statistics tell you about the data that you 
happen to have. Causal models claim to tell you what will happen to some 
of the numbers if you intervene to change other numbers. This is a claim 
worth examining. Something has to remain constant amidst the changes. 
What is this, and why is it constant? Chapters 4 and 5 will explain how to 
fit regression equations like (2) and (3). Chapter 6 discusses some examples 
from contemporary social science, and examines the constancy-in-the-midst- 
of-changes assumptions that justify causal inference by statistical models. 
Response schedules will be used to formalize the constancy assumptions. 


Exercise setA 


1. Inthe HIP trial (table 1), what is the evidence confirming that treatment 
has no effect on death from other causes? 

2. Someone wants to analyze the HIP data by comparing the women who 
accept screening to the controls. Is this a good idea? 

3. Was Snow’s study of the epidemic of 1853-54 (table 2) a randomized 
controlled experiment or a natural experiment? Why does it matter that 
the Lambeth company moved its intake point in 1852? Explain briefly. 
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4. Was Yule’s study a randomized controlled experiment or an observational 
study? 


5. Inequation (2), suppose the coefficient of AOut had been —0.755. What 
would Yule have had to conclude? If the coefficient had been +0.005? 


Exercises 6-8 prepare for the next chapter. If the material is unfamiliar, you 
might want to read chapters 16-18 in Freedman-Pisani-Purves (2007), or 
similar material in another text. Keep in mind that 


variance = (standard error)”. 


6. Suppose X1, X2,..., Xn are independent random variables, with com- 
mon expectation u and variance o?. Let S = Xj + Xo +--+: + Xp. 
Find the expectation and variance of S,;. Repeat for S;/n. 


7. Suppose X1, X2,..., Xn are independent random variables, with a com- 
mon distribution: P(X; = 1) = p and P(X; = 0) = 1 — p, where 
0< p< l1. Let S, = X;+X2+---+X,. Find the expectation and 
variance of S,. Repeat for S,/n. 


8. What is the law of large numbers? 
9. Keefe et al (2001) summarize their data as follows: 


“Thirty-five patients with rheumatoid arthritis kept a diary for 30 
days. The participants reported having spiritual experiences, such 
as a desire to be in union with God, on a frequent basis. On days that 
participants rated their ability to control pain using religious coping 
methods as high, they were much less likely to have joint pain.” 


Does the study show that religious coping methods are effective at con- 
trolling joint pain? If not, how would you explain the data? 


10. According to many textbooks, association is not causation. To what 
extent do you agree? Discuss briefly. 


1.5 End notes for chapter 1 


Experimental design is a topic in itself. For instance, many experiments 
block subjects into relatively homogeneous groups. Within each group, some 
are chosen at random for treatment, and the rest serve as controls. Blinding is 
another important topic. Of course, experiments can go off the rails. For one 
example, see EC/IC Bypass Study Group (1985), with commentary by Sundt 
(1987) and others. The commentary makes the case that management and 
reporting of this large multi-center surgery trial broke down, with the result 
that many patients likely to benefit from surgery were operated on outside the 
trial and excluded from tables in the published report. 
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Epidemiology is the study of medical statistics. More formally, epide- 
miology is “the study of the distribution and determinants of health-related 
states or events in specified populations and the application of this study to 
control of health problems.” See Last (2001, p. 62) and Gordis (2004, p. 3). 


Health effects of smoking. See Cornfield et al (1959), International 
Agency for Research on Cancer (1986). For a brief summary, see Freedman 
(1999). There have been some experiments on smoking cessation, but these 
are inconclusive at best. Likewise, animal experiments can be done, but there 
are difficulties in extrapolating from one species to another. Critical commen- 
tary on the smoking hypothesis includes Berkson (1955) and Fisher (1959). 
The latter makes arguments that are almost perverse. (Nobody’s perfect.) 


Telephones and breast cancer. The correlation is 0.74 with 165 coun- 
tries. Breast cancer death rates (age standardized) are from 


http://www-dep.iarc.fr/globocan/globocan.html 
Population figures, counts of telephone lines (and much else) are available at 
http://www.cia.gov/cia/publications/factbook 


HIP. The best source is Shapiro et al (1988). The actual randomiza- 
tion mechanism involved list sampling. The differentials in table 1 persist 
throughout the 18-year followup period, and are more marked if we take cases 
incident during the first 7 years of followup, rather than 5. Screening ended 
after 4 or 5 years and it takes a year or two for the effect to be seen, so 7 years 
is probably the better time period to use. 

Intention-to-treat measures the effect of assignment, not the effect of 
screening. The effect of screening is diluted by crossover—only 2/3 of the 
women came in for screening. When there is crossover from the treatment 
arm to the control arm, but not the reverse, it is straightforward to correct 
for dilution. The effect of screening is to reduce the death rate from breast 
cancer by a factor of 2. This estimate is confirmed by results from the Two- 
County study. See Freedman et al (2004) for a review; correcting for dilution 
is discussed there, on p. 72; also see Freedman (2006b). 

Subjects in the treatment group who accepted screening had a much 
lower death rate from all causes other than breast cancer (table 1). Why? 
For one thing, the compliers were richer and better educated; mortality rates 
decline as income and education go up. Furthermore, the compliers probably 
took better care of themselves in general. See section 2.2 in Freedman-Pisani- 
Purves (2007); also see Petitti (1994). 

Recently, questions about the value of mammography have again been 
raised, but the evidence from the screening trials is quite solid. For reviews, 
see Smith (2003) and Freedman et al (2004). 
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Snow on cholera. At the end of the 19th century, there was a burst of 
activity in microbiology. In 1878, Pasteur published La théorie des germes et 
ses applications à la médecine et à la chirurgie. Around that time, Pasteur and 
Koch isolated the anthrax bacillus and developed techniques for vaccination. 
The tuberculosis bacillus was next. In 1883, there were cholera epidemics in 
Egypt and India, and Koch isolated the vibrio (prior work by Filippo Pacini 
in 1854 had been forgotten). 

There was another epidemic in Hamburg in 1892. The city fathers turned 
to Max von Pettenkofer, a leading figure in the German hygiene movement of 
the time. He did not believe Snow’s theory, holding instead that cholera was 
caused by poison in the ground. Hamburg was a center of the slaughterhouse 
industry: von Pettenkofer had the carcasses of dead animals dug up and hauled 
away, in order to reduce pollution of the ground. The epidemic continued 
until the city lost faith in von Pettenkofer, and turned in desperation to Koch. 

References on the history of cholera include Rosenberg (1962), Howard- 
Jones (1975), Evans (1987), Winkelstein (1995). Today, the molecular biol- 
ogy of the cholera vibrio is reasonably well understood. There are surveys by 
Colwell (1996) and Raufman (1998). For a synopsis, see Alberts et al (1994, 
pp. 484, 738). For valuable detail on Snow’s work, see Vinten-Johansen et al 
(2003). Also see http://www.ph.ucla.edu/epi/snow. html. 

In the history of epidemiology, there are many examples like Snow’s 
work on cholera. For instance, Semmelweis (1860) discovered the cause 
of puerperal fever. There is a lovely book by Loudon (2000) that tells the 
history, although Semmelweiss could perhaps have been treated a little more 
gently. Around 1914, to mention another example, Goldberger showed that 
pellagra was the result of a diet deficiency. Terris (1964) reprints many of 
Goldberger’s articles; also see Carpenter (1981). The history of beriberi 
research is definitely worth reading (Carpenter, 2000). 


Quetelet. A few sentences will indicate the flavor of his enterprise. 


“In giving my work the title of Social Physics, I have had no other aim 
than to collect, in a uniform order, the phenomena affecting man, nearly as 
physical science brings together the phenomena appertaining to the material 
world. ... in a given state of society, resting under the influence of certain 
causes, regular effects are produced, which oscillate, as it were, around a 
fixed mean point, without undergoing any sensible alterations. . . . 

“This study ... has too many attractions—it is connected on too many 
sides with every branch of science, and all the most interesting questions in 
philosophy—to be long without zealous observers, who will endeavour to 
carry it farther and farther, and bring it more and more to the appearance of 
a science.” (Quetelet 1842, pp. vii, 103) 
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Yule. The “errors” in (1) and (2) play different roles in the theory. In (1), 
we have random errors which are unobservable parts of a statistical model. In 
(2), we have residuals which can be computed as part of model fitting; (3) is 
like (2). Details are in chapter 4. For sympathetic accounts of the history, see 
Stigler (1986) and Desrosiéres (1993). Meehl (1954) provides some well- 
known examples of success in prediction by regression. Predictive validity 
is best demonstrated by making real “ex ante”—before the fact—forecasts 
in several different contexts: predicting the future is a lot harder than fitting 
regression equations to the past (Ehrenberg and Bound 1993). 


John Stuart Mill. The contrast between experiment and observation 
goes back to Mill (1843), as does the idea of confounding. (In the seventh 
edition, see Book III, Chapters VII and X, esp. pp. 423 and 503.) 


Experiments vs observational studies. Fruits-and-vegetables epidemi- 
ology is a well-known case where experiments contradict observational data. 
In brief, the observational data say that people who eat a vitamin-rich diet 
get cancer at lower rates, “so” vitamins prevent cancer. The experiments say 
that vitamin supplements either don’t help or actually increase the risk. 

The problem with the observational studies is that people who eat (for 
example) five servings of fruit and vegetables every day are different from 
the rest of us in many other ways. It is hard to adjust for all these differences 
by purely statistical methods (Freedman-Pisani-Purves, 2007, p. 26 and note 
23 on p. A6). Research papers include Clarke and Armitage (2002), Virtamo 
et al (2003), Lawlor et al (2004), Cook et al (2007). Hercberg et al (2004) 
get a positive effect for men not women. 

Hormone replacement therapy (HRT) is another example (Petitti 1998, 
2002). The observational studies say that HRT prevents heart disease in 
women, after menopause. The experiments show that HRT has no benefit. 
The women who chose HRT were different from other women, in ways that 
the observational studies missed. We will discuss HRT again in chapter 7. 

Ioannidis (2005) shows that by comparison with experiments, across a 
variety of interventions, observational studies are much less likely to give 
results which can be replicated. Also see Kunz and Oxman (1998). 

Anecdotal evidence—based on individual cases, without a systematic 
comparison of different groups—is a weak basis for causal inference. If there 
is no control group in a study, considerable skepticism is justified, especially 
if the effect is small or hard to measure. When the effect is dramatic, as 
with penicillin for wound infection, these statistical caveats can be set aside. 
On penicillin, see Goldsmith (1946), Fleming (1947), Hare (1970), Walsh 
(2003). Smith and Pell (2004) have a good—and brutally funny—discussion 
of causal inference when effects are large. 
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The Regression Line 


2.1 Introduction 


This chapter is about the regression line. The regression line is important 
on its own (to statisticians), and it will help us with multiple regression in 
chapter 4. The first example is a scatter diagram showing the heights of 1078 
fathers and their sons (figure 1). Each pair of fathers and sons becomes a dot 
on the diagram. The height of the father is plotted on the x-axis; the height of 
his son, on the y-axis. The left hand vertical strip (inside the chimney) shows 
the families where the father is 64 inches tall to the nearest inch; the right hand 
vertical strip, families where the father is 72 inches tall. Many other strips 
could be drawn too. The regression line approximates the average height of 
the sons, given the heights of their fathers. This line goes through the centers 
of all the vertical strips. The regression line is flatter than the SD line, which 
is dashed. “SD” is shorthand for “standard deviation”; definitions come next. 


2.2 The regression line 


We have n subjects indexed by i = 1, ..., n, and two data variables x 
and y. A data variable stores a value for each subject in a study. Thus, x; is 
the value of x for subject i, and y; is the value of y. In figure 1, a “subject” 
is a family: x; is the height of the father in family i, and y; is the height of 
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Figure 1. Heights of fathers and sons. Pearson and Lee (1903). 


SON’S HEIGHT (INCHES) 


FATHER’S HEIGHT (INCHES) 


the son. For Yule (section 1.4), a “subject” might be a metropolitan union, 
with x; = AOut for union i, and y; = APaup. 

The regression line is computed from five summary statistics: (i) the 
average of x, (ii) the SD of x, (iii) the average of y, (iv) the SD of y, and 
(v) the correlation between x and y. The calculations can be organized as 
follows, with “variance” abbreviated to “var”; the formulas for y and var(y) 
are omitted. 


1 1 
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(2) the SD of x is Sy = y vat x, 
a pet xj—-X 
(3) x; in standard units is z; = ; 
Sx 


and the correlation coefficient is 


1G (a y-y 
(4) r= ( e ). 


S Sy 
i=l x y 


We’re tacitly assuming sy # 0 and sy # 0. Necessarily, —1 < r < 1: see 
exercise B16 below. The correlation between x and y is often written as 
r(x, y). Let sign(r) = +1 when r > 0 and sign(r) = —1 when r < 0. The 
regression line is flatter than the SD line, by (5) and (6) below. 


(5) The regression line of y on x goes through the point of averages 
(x, y). The slope is rsy/s,. The intercept is y — slope - x. 


(6) The SD line also goes through the point of averages. The slope 
is sign(r)sy/s,. The intercept is y — slope -x. 


Figure 2. Graph of averages. The dots show the average height 
of the sons, for each value of father’s height. The regression line 
(solid) follows the dots: it is flatter than the SD line (dashed). 
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The regression of y on x, also called the regression line for predicting y from 
x, is a linear approximation to the graph of averages, which shows the average 
value of y for each x (figure 2). 

Correlation is a key concept. Figure 3 shows the correlation coefficient 
for three scatter diagrams. All the diagrams have the same number of points 
(n = 50), the same means (x = y = 50), and the same SDs (sx = sy = 15). 
The shapes are very different. The correlation coefficient r tells you about the 
shapes. (If the variables aren’t paired—two numbers for each subject—you 
won’t be able to compute the correlation coefficient or regression line.) 


Figure 3. Three scatter diagrams. The correlation measures the 
extent to which the scatter diagram is packed in around a line. If 
the sign is positive, the line slopes up. If sign is negative, the line 
slopes down (not shown here). 
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If you use the line y = a + bx to predict y from x, the error or residual 
for subject i is e; = y; — a — bx;, and the MSE is 


1 n 
D2 
n 

i=l 


The RMS error is the square root of the MSE. For the regression line, as will 
be seen later, the MSE equals (1 — r?) var y. The abbreviations: MSE stands 
for mean square error; RMS, for root mean square. 


A THEOREM DUE TO C.-F. Gauss. Among all lines, the regression line 
has the smallest MSE. 


A more general theorem will be proved in chapter 3. If the material in 
sections 1—2 is unfamiliar, you might want to read chapters 8—12 in Freedman- 
Pisani-Purves (2007). 


1 
100 
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2.3 Hooke’s law 


A weight is hung on the end of a spring whose length under no load is a. 
The spring stretches to a new length. According to Hooke’s law, the amount 
of stretch is proportional to the weight. If you hang weight x; on the spring, 
the length is 


(7) Y; =a+bx,;+6, fori =1,...,n. 


Equation (7) is a regression model. In this equation, a and b are constants that 
depend on the spring. The values are unknown, and have to be estimated from 
data. These are parameters. The e; are independent, identically distributed, 
mean 0, variance o°. These are random errors, or disturbances. The variance 
o? is another parameter. You choose x;, the weight on occasion i. The 
response Y; is the length of the spring under the load. You do not see a, b, or 
the €;. 

Table 1 shows the results of an experiment on Hooke’s law, done in a 
physics class at U.C. Berkeley. The first column shows the load. The second 
column shows the measured length. (The “spring” was a long piece of piano 
wire hung from the ceiling of a big lecture hall.) 


Table 1. An experiment on Hooke’s law. 


Weight (kg) Length (cm) 


439.00 
439.12 
439.21 
439.31 
439.40 
439.50 


CSCAOnRNO 


jů 


We use the method of least squares to estimate the parameters a and b. 
In other words, we fit the regression line. The intercept is 


a = 439.01 cm. 


A hat over a parameter denotes an estimate: we estimate a as 439.01 cm. The 
slope is 

b = 0.05 cm per kg. 
We estimate b as 0.05 cm per kg. (The dotted equals sign “=” 
equal; there is roundoff error in the numerical results.) 


means nearly 
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There are two conclusions. (i) Putting a weight on the spring makes 
it longer. (ii) Each extra kilogram of weight makes the spring about 0.05 
centimeters longer. The first is a (pretty obvious) qualitative inference; the 
second is quantitative. The distinction between qualitative and quantitative 
inference will come up again in chapter 6. 


Exercise set A 
1. In the Pearson-Lee data, the average height of the fathers was 67.7 inches; 
the SD was 2.74 inches. The average height of the sons was 68.7 inches; 
the SD was 2.81 inches. The correlation was 0.501. 


(a) True or false and explain: because the sons average an inch taller 
than the fathers, if the father is 72 inches tall, it’s 50-50 whether the 
son is taller than 73 inches. 


(b) Find the regression line of son’s height on father’s height, and its 
RMS error. 


2. Can you determine a in equation (7) by measuring the length of the 
spring with no load? With one measurement? Ten measurements? Ex- 
plain briefly. 

3. Use the data in table 1 to find the MSE and the RMS error for the 
regression line predicting length from weight. Which statistic gives a 
better sense of how far the data are from the regression line? Hint: keep 
track of the units, or plot the data, or both. 


4. The correlation coefficient is a good descriptive statistic for one of the 
three diagrams below. Which one, and why? 


2.4 Complexities 


Compare equation (7) with equation (8): 


(7) Y; = a + bxi + &, 
(8) Y; = a bx; ej. 
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Looks the same? Take another look. In the regression model (7), we can’t 
see the parameters a, b or the disturbances €;. In the fitted model (8), the 
estimates a, b are observable, and so are the residuals e;. With a large sample, 
â = a and b = b, soe; = «;. But 

>4= 
The e; in (8) is called a residual rather than a disturbance term or random 


error term. Often, e; is called an “error,” although this can be confusing. 
“Residual” is clearer. 


Estimates aren’t parameters, and residuals aren’t random errors. 


The Y; in (7) are random variables, because the €; are random. How are 
random variables connected to data? The answer, which involves observed 
values, will be developed by example. The examples will also show how 
ideas of mean and variance can be extended from data to random variables— 
with some pointers on going back and forth between the two realms. We 
begin with the mean. Consider the list {1, 2, 3, 4, 5, 6}. This has mean 3.5 
and variance 35/12, by formula (1). So far, we have a tiny data set. Random 
variables are coming next. 

Throw a die n times. (A die has six faces, all equally likely; one face 
has 1 spot, another face has 2 spots, and so forth, up to 6.) Let U; be the 
number of spots on the ith roll, fori = 1, ..., n. The Uj are (better, are mod- 
eled as) independent, identically distributed random variables—like choosing 
numbers at random from the list {1, 2, 3, 4, 5, 6}. Each random variable has 
mean (expectation, aka expected value) equal to 3.5, and variance equal to 
35/12. Here, mean and variance have been applied to a random variable—the 
number of spots when you throw a die. 

The sample mean and the sample variance are 


S ee ESE 2 
(9) ee LN and mae) = a . 
i= (= 


The sample mean and variance in (9) are themselves random variables. In 
principle, they differ from E(U;) and var(U;), which are fixed numbers—the 
expectation and variance, respectively, of U;. When n is large, 


(10) U = E(U;) = 3.5, var {U1, ..., Un} = var (U;) = 35/12. 


That is how the expectation and variance of a random variable are estimated 
from repeated observations. 
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e Random variables have means; so do data sets. 
e Random variables have variances; so do data sets. 


The discussion has been a little abstract. Now someone actually throws 
the die n = 100 times. That generates some data. The total number of spots 
is 371. The average number of spots per roll is 371/100 = 3.71. This is not 
U, but the observed value of U. After all, U has a probability distribution: 
3.71 just sits there. Similarly, the measurements on the spring in Hooke’s 
law (table 1) aren’t random variables. According to the regression model, the 
lengths are observed values of the random variables Y; defined by (7). 


In a regression model, as a rule, the data are observed values of 
random variables. 


Now let’s revisit (8). If (7) holds, the a, b, and e; in (8) can be viewed 
as observable random variables or as observed values, depending on context. 
Sometimes, observed values are called realizations. Thus, 439.01 cm is a 
realization of the random variable å. 

There is one more issue to take up. Variance is often used to measure 
spread. However, as the next example shows, variance usually has the wrong 
units and the wrong size: take the square root to get the SD. 


Example 1. American men age 18-24 have an average weight of 170 
lbs. The typical person in this group weighs around 170 lbs, but will not 
weigh exactly 170 Ibs. The typical deviation from average is . The 
variance of weight is 900 square pounds: wrong units, wrong size. Do not 
put variance into the blank. The SD is variance = 30 lbs. The typical 
deviation from average weight is something like 30 lbs. 


Example 2. Roll a die 100 times. Let S = X1 + --- + X100 be the total 
number of spots. This is a random variable, with E(S) = 100 x 3.5 = 350. 
You will get around 350 spots, give or take or so. The variance of S 
is 100 x 35/12 = 292. (The 35/12 is the variance of the list {1, 2, 3, 4, 5, 6}, 
as mentioned earlier.) Do not put 292 into the blank. To use the variance, 
take the square root. The SE—standard error—is /292 = 17. Put 17 into 
the blank. (The SE applies to random variables; the SD, to data.) 


The number of spots will be around 350, but will be off 350 by something 
like 17. The number of spots is unlikely to be more than two or three SEs 
away from its expected value. For random variables, the standard error is the 
square root of the variance. (The standard error of a random variable is often 
called its standard deviation, which can be confusing.) 
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2.5 Simple vs multiple regression 


A simple regression equation has on the right hand side an intercept 
and an explanatory variable with a slope coefficient. A multiple regression 
equation has several explanatory variables on the right hand side, each with 
its own slope coefficient. To study multiple regression, we will need matrix 
algebra. That is covered in chapter 3. 


Exercise set B 


1. In equation (1), variance applies to data, or random variables? What 
about correlation in (4)? 

2. On page 22, below table 1, you will find the number 439.01. Is this a 
parameter or an estimate? What about the 0.05? 

3. Suppose we didn’t have the last line in table 1. Find the regression of 
length on weight, based on the data in the first 5 lines of the table. 

4. In example 1, is 900 square pounds the variance of a random variable? 
or of data? Discuss briefly. 

5. In example 2, is 35/12 the variance of a random variable? of data? 
maybe both? Discuss briefly. 

6. A die is rolled 180 times. Find the expected number of aces, and the 
variance for the number of aces. The number of aces will be around 

, give or take or so. (A die has six faces, all 

equally likely; the face with one spot is the “ace.” 


7. A die is rolled 250 times. The fraction of times it lands ace will be 
around , give or take or so. 

8. One hundred draws are made at random with replacement from the box 

1| {2} |2| 15]||. The draws come out as follows: 17 |1 Ps, 54 | 2s, 

and 29 | 5s. Fill in the blanks. 


(a) For the , the observed value is 0.8 SEs above the ex- 
pected value. (Reminder: SE = standard error.) 
(b) For the , the observed value is 1.33 SEs above the 


expected value. 
Options (two will be left over): 
number of 1’s number of 2’s_ number of 5’s_ sum of the draws 


If exercises 6—8 cover unfamiliar material, you might want to read chapters 
16-18 in Freedman-Pisani-Purves (2007), or similar material in another text. 


9. Equation (7) is a . Options: 
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model parameter random variable 


10. Inequation (7), ais . Options (more than one may be right): 
observable unobservable aparameter a random variable 
Repeat for b. For €;. For Y;. 


11. According to equation (7), the 439.00 in table 1 is . Options: 


a parameter 
a random variable 
the observed value of a random variable 


12. Suppose x1, ..., Xn are real numbers. Let x = (x1 +---+2%,)/n. Let 
c be areal number. 
(a) Show that )~"_, (aj — x) =0. 
(b) Show that X; (xj — c)? = [XZ @i — ¥)?] nE — 0)”. 
Hint: (x; — c) = (xxi ~X¥) + œ — c). 
(c) Show that ae 1 &i— c)’, as a function of c, has a unique minimum 
atc =X. 
(d) Show that }~"_, x;? = [ or Gi - x)? ] + nx’. 


13. A statistician has a sample, and is computing the sum of the squared 
deviations of the sample numbers from a number q. The sum of the 
squared deviations will be smallest when q is the . Fill in the 
blank (25 words or less) and explain. 


14. Suppose x1, ...,Xn and yj,..., Yn have means y, y; the standard devi- 
ations are sx > 0, sy > 0; and the correlation is r. Let 


cov(x, y) = 4 Ly i X) — Y). 

(“cov” is shorthand for covariance.) Show that— 

(a) cov(x, y) =rsySy. 

(b) The slope of the regression line for predicting y from x is 

cov(x, y)/var(x). 

(c) var(x) = cov(x, x). 

(d) cov(x, y) = Xy — XV. 

(e) var(x) = x2 — X°. 

15. Suppose x;,...,%, and y1, ..., Yn are real numbers, with sy > 0 and 

Sy > 0. Let x* be x in standard units; similarly for y. Show that 
r(x, y) =r(x*, y*). 
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Suppose x1,...,%, and y1,..., Yn are real numbers, with x = y = 0 
and sy = sy = 1. Show that 7 )_,(; + yi)? = 2(1 + r) and 
+ Oi — yi)? = 201 — r), where r = r(x, y). Show that 


-l<r<l. 


A die is rolled twice. Let X; be the number of spots on the ith roll for 
a aes 
(a) Find P(X; = 3| X1 + X2 = 8), the conditional probability of a 3 
on the first roll given a total of 8 spots. 
(b) Find P(X; + X2 =7| X, = 3). 
(c) Find F(X, | X1 +X2 = 6), the conditional expectation of X; given 
that X; + X2 = 6. 


(Hard.) Suppose x1, ..., Xn are real numbers. Suppose n is odd and the 
x; are all distinct. There is a unique median u: the middle number when 
the x’s are arranged in increasing order. Let c be a real number. Show 
that f(c) = Se , lx; — c|, as a function of c, is minimized when c = u. 
Hints. You can’t do this by calculus, because f isn’t differentiable. 
Instead, show that f(c) is (1) continuous, (ii) strictly increasing as c 
increases for c > u, i.e., y < cı < c2 implies f(c1) < f(c2), and 
(iii) strictly decreasing as c increases for c < u. It’s easier to think 
about claims (ii) and (iii) when c differs from all the x’s. You may as 
well assume that the x; are increasing with i. If you pursue this line 
of reasoning far enough, you will find that f is linear between the x’s, 
with corners at the x’s. Moreover, f is convex, i.e., f[(x + y)/2] < 


[f(x) + f()1/2. 


Comment. If — f is convex, then f is said to be concave. 


2.6 End notes for chapter 2 


In (6), if r = 0, you can take the slope of the SD line to be either s, /s, 


or —Sy/S,. In other applications, however, sign(0) is usually defined as 0. 


Hooke’s law (7) is a good approximation when the weights are relatively 


small. When the weights are larger, a quadratic term may be needed. Close to 
the “elastic limit” of the spring, things get more complicated. Experimental 
details were simplified. For data sources, see pp. A11, A14 in Freedman- 
Pisani-Purves (2007). 


For additional material on random variables, including the connection 


between physical dice and mathematical models for dice, see 


http://www.stat.berkeley.edu/users/census/rv.pdf 
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Matrix Algebra 


3.1 Introduction 


Matrix algebra is the key to multiple regression (chapter 4), so we review 
the basics here. Section 4 covers positive definite matrices, with a quick 
introduction to the normal distribution in section 5. A matrix is a rectangular 
array of numbers. (In this book, we only consider matrices of real numbers.) 
For example, M is a 3 x 2 matrix—3 rows, 2 columns—and b is a 2 x 1 


column vector: 
3 -l 
w=(2 a). =(3). 
—1 4 


The ijth element of M is written M;;, e.g., M32 = 4; similarly, b2 = —3. 
Matrices can be multiplied (element-wise) by a scalar. Matrices of the same 
size can be added (again, element-wise). For instance, 


3 =} 6 -2 3  -l 3 2 6 1 
2x( > -1) =( 4 -2), ( > -1)+/ 4 -1)=( 6-2 
—1 4 —2 8 =] 4 = 1 —2 o) 


30 CHAPTER 3 


An mxn matrix A can be multiplied by a matrix B of size n x p. The 
product is an m x p matrix, whose ikth element is }- j Aij Bix. For example, 


3x3 aye) 12 
Mo= (253 4 cosc») = ( 0). 
(=1)x3 + 4x(-3) —15 


Matrix multiplication is not commutative. This may seem tricky at first, but 
you get used to it. Exercises 1—2 (below) provide a little more explanation. 
The matrix Om.» is an m x n matrix all of whose entries are 0. For 


instance, 
0 _/000 
2x37 o 0 0)’ 


The m xm identity matrix is written Zm or Imxm. This matrix has 1’s on the 
diagonal and 0’s off the diagonal: 


1 0 0 
box {0 1 0). 
0 0 1 


If A is mxn, then Ipxm XA = A= AX Inn. 
An m xn matrix A can be “transposed.” The result is an n xm matrix 
denoted A’ or AT. For example, 


If A’ = A, then A is symmetric. 

If u and v are n x 1 column vectors, the inner product or dot product is 
u» v = u' xv. If this is 0, then u and v are orthogonal: we write u L v. The 
norm or length of wis ||u||, where ||u||? = u-u. People often write |u| instead 
of ||u||. The inner product u - v equals the length of u, times the length of v, 
times the cosine of the angle between the two vectors. If u L v, the angle is 
90°, and cos(90°) = 0. 

For square matrices, the trace is the sum of the diagonal elements: 


Exercise setA 


1. Suppose A is mxn and B is n x p. For i and j with 1 < i < m and 
1 < j < p, let r; be the ith row of A and let cj be the jth column of B. 
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What is the size of r;? of cj? How is r; x cj related to the ijth element 
of Ax B? 


2. Suppose A is m xn, while u, v are n x 1 and « is scalar. Show that 
Au € R”, A(au) = aAu, and A(u + v) = Au + Av. As they say, A 
is a linear map from R” to R”, where R” is n-dimensional Euclidean 
space. (For instance, R! is the line and R? is the plane.) 


3. If Aism xn, check that A + Onxn =Omxn + A= A. 


For exercises 4 and 5, let 
3-1 
M= (2 —1 ) ; 
1 —4 


. Show that Bx3M = M = Mhx2. 
5. Compute M'M and MM’. Find the trace of M'M and the trace of MM’. 


Find the lengths of u and v, defined below. Are these vectors orthogonal? 
Compute the outer product u x v'. What is the trace of the outer product? 


-O 


3.2 Determinants and inverses 


Matrix inversion will be needed to get regression estimates and their 
standard errors. One way to find inverses begins with determinants. The 
determinant of a square matrix is computed by an inductive procedure: 


1 2 
5 3 


1 2 3 
det{ 2 3 1]=1xdet zi — 2xdet Xe +0xdet T 
014 1 1 1 1 3 1 


= 1x3-—1)—2x(2—3)+0x(2—9) =4. 


det (4) = 4, det ( ) = (1x3) — 2x5) = -7, 


Here, we work our way down the first column, getting the determinants of 
the smaller matrices obtained by striking out the row and column through each 
current position. The determinants pick up extra signs, which alternate + 
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and —. The determinants with the extra signs tacked on are called cofactors. 
With a 4x 4 matrix, for instance, the extra signs are 


The determinant of a matrix is ae ai1Ci1, where aj; is the ijth element in 
the matrix, and c;; is the cofactor. (Watch it: the determinants have signs of 
their own, as well as the extra signs shown above.) It turns out that you can 
use any row or column, not just column 1, for computing the determinant. As 
a matter of notation, people often write |A| instead of det(A). 

Let v1, v2,..., uk be n x 1 vectors. These are linearly independent if 
C1vy + C202 + +++ + CkYk = Onx1 implies cy = --- = cg = 0. The rank of a 
matrix is the number of linearly independent columns (or rows—has to be the 
same). Ifn > p, ann x p matrix X has full rank if the rank is p; otherwise, 
X is rank deficient. Ann xn matrix A has full rank if and only if det(A) 4 0. 
Then the matrix has an inverse A7 !: 


AxA! = AT! xA=Inxn- 


Such matrices are invertible or non-singular. The inverse is unique; this 
follows from existence. Conversely, if A is invertible, then det(A) 0 and 
the rank of A is n. 

The inverse can be computed as follows: 


AT! = adj(A)/ det(A), 


where adj (A) is the transpose of the matrix of cofactors. (This is the classical 
adjoint.) For example, 


wai(i 2)_-( 3 7 
te ee em 1J’ 
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Exercise set B 


For exercises 1—7 below, let 


10. 


11. 


12. 
13; 


14. 


OO ON N R 


12 1 2 3 1 2 
aas J s=(2 3 ') c=(2 ‘). 
O 1 1 3 6 
Find adj(B). This is just to get on top of the definitions; later, we do all 
this sort of thing on the computer. 


Show that A x adjA = adjA x A = det(A) x J,. Repeat, for B. What is 
n in each case? 


Find the rank and the trace of A. Repeat, for B. 
Find the rank of C. 
If possible, find the trace and determinant of C. If not, why not? 
If possible, find A”. If not, why not? (Hint: A? = A x A.) 
If possible, find C°. If not, why not? 
If M is mxn and N is m xn, show that (M + NY = M' + N'. 
Suppose M is mxn and N is nx p. 

(a) Show that (MNY = N'M’. 

(b) Suppose m = n = p, and M, N are both invertible. Show that 

(MN)~! = N~!M~! and (M^)! = (M~!Y’. 

Suppose X is nx p with p < n. If X has rank p, show that X’X has rank 
p, and conversely. Hints. Suppose X has rank p and c is px 1. Then 
X'Xc = 0px1 > C XKe =0 > ||Xel? =0 > Xe = nx. 
Notes. The matrix X’X is px p. The rank is p if and only if X’X is 
invertible. The = is shorthand for “implies.” 


If Ais mxn and B isn xm, show that trace(AB) = trace(BA). Hint: 
the iith element of AB is DS Aj; Bj;, while the jjth element of BA is 
2o; Bi Aij- 

If u and v are n x 1, show that |u + v||? = llull? + Ivl? + 2u- v. 


If u and v are n x 1, show that ||u + v||? = llull? + |lu|l? if and only if 
u L v. (This is Pythagoras’ theorem in n dimensions.) 


Suppose X is n x p with rank p < n. Suppose Y isn x1. Let p= 
(XX)! X'Y ande =Y — XB. 

(a) Show that X’X is p x p, while X’Y is px1. 

(b) Show that X’X is symmetric. Hint: look at exercise 9(a). 

(c) Show that X’X is invertible. Hint: look at exercise 10. 
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(d) Show that (X’X)~! is px p, so Ê = (X'X)!X’Y is px 1. 

(e) Show that (X’X)~! is symmetric. Hint: look at exercise 9(b). 

(f) Show that XÊ and e = Y — XÊ arenxl. 

(g) Show that X'XÊ = X'Y, and hence X'e = 0px1. 

(h) Show that e L XB, so |Y ||? = || XB? + |lell. 

(i) Ify is px1, show that |Y — Xy |? = IY — XÊI? + 1X (Ê — y). 
Hint: Y — Xy =Y — X+ X( — y). 

G) Show that || Y — Xy ||? is minimized when y = B. 

(k) If Bis px1 with Y — X L X, show that B = B. Notation: v L X 
if v is orthogonal to each column of X. Hint: what is X’ (Y — XB)? 


(1) Is XX’ invertible? Hints. By assumption, p < n. Can you find an 
nx 1 vector c Æ Onx1 with c'X = 01x p? 

(m) Is (X’X)~! = x—!(x’)~!9 
Notes. The “OLS estimator” is A, where OLS is shorthand for “or- 
dinary least squares.” This exercise develops a lot of the theory for 
OLS estimators. The geometry in brief: X’e = = Opx1 means that e is 
orthogonal—perpendicular—to each column of X. Hence Y = xB 
is the projection of Y onto the columns of X, and the closest point in 
column space to Y. Part (j) is Gauss’ theorem for multiple regression. 


In exercise 14, suppose p = 1, so X is a column vector. Show that 
B=X-Y/|XP. 

In exercise 14, suppose p = 1 and X is a column of 1’s. Show that 
B is the mean of the Y’s. How is this related to exercise 2B12(c), i.e., 
part (c), exercise 12, set B, chapter 2? 


This exercise explains a stepwise procedure for computing f in exer- 
cise 14. There are hints, but there is also some work to do. Let M be the 
first p — 1 columns of X, so M is nx (p — 1). Let N be the last column 
of X,so Nisnx1. 


(i) Let fı = (M’M)~!M’Y and f =Y—My. 
(ii) Let p = (M'M)T! M'N and g = N — Mjo. 
(iii) Let 73 = f-g/|lg\|? and e = f — gf. 
Show that e L X. (Hint: begin by checking f L M and g L M.) 


Finally, show that 
a _ (i-e 
f ( p ) l 
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Note. The procedure amounts to (i) regressing Y on M, (ii) regressing 
N on M, then (iii) regressing the first set of residuals on the second. 


18. Suppose u,v are nx 1; neither is identically 0. What is the rank of uxv'? 
3.3 Random vectors 


Ui 
Let U = (2) a 3x1 column vector of random variables. Then 
U3 


E(U1) 
E(U) = (ews ) a 3x1 column vector of numbers. On the other hand, 
E(U3) 


cov(U) is 3x3 matrix of real numbers: 


Ui — EU) 
cov(U) = el(u 2 EU» | (Ui — EUI) U2- EU») U3- E(U3)) | 
U3 — E(U3) 


Here, cov applies to random vectors, not to data (“cov” is shorthand for 
covariance). The same definitions can be used for vectors of any size. 
People sometimes use correlations for random variables: the correlation 


between U, and U2, for instance, is cov(U;, U2)/y var (U1 )var (U2). 


Exercise set C 
1. Show that the 1,1 element of cov(U) equals var(U;); the 2,3 element 
equals cov(U2, U3). 
Show that cov(U) is symmetric. 


3. If A is a fixed (i.e., non-random) matrix of size n x 3 and B is a fixed 
matrix of size 1 xm, show that E(A UB) = AE(U)B. 


Show that cov(AU) = Acov(U)A’. 
5. Ifcisa fixed vector of size 3 x 1, show that var(c’/U) = c’cov(U)c and 
cov(U +c) =cov(U). 
Comment. If V is ann x 1 random vector, C is a fixed m xn matrix, and D 
is a fixed m x 1 vector, then cov(CV + D) = Ccov(V)C’. 


6. What’s the difference between U = (U1 + U2 + U3)/3 and E(U)? 


7. Suppose £ and ¢ are two random vectors of size 7x 1. If §’¢ = 0, are £ 
and ¢ independent? What about the converse: if € and ¢ are independent, 
is &’¢ = 0? 
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8. Suppose € and ¢ are two random variables with E(é) = E(¢) = 0. 
Show that var(£) = E(E*) and cov(é, ¢) = E(&€). 


Notes. More generally, var(é) = E(&*) — [E(€)]? and cov(é, j= 
Et) — E)E). 

9. Suppose é is an nx 1 random vector with E (£) = 0. Show that cov (£) = 
EEE’). 
Notes. Generally, cov (£) = E (E^) — E(é) E(&’) and E (E^) = [E (EV. 


10. Suppose &;, ¢; are random variables fori = 1,... „n. As pairs, they are 
independent and identically distributed in i. Let § = 1 X; é, and 
likewise for ¢. True or false, and explain: 


(a) cov(&, ¢i) is the same for every i. 2 
(b) cov(§, i) = + DG — HG — 9). 


11. The random variable X has density f on the line; ø and u are real 
numbers. What is the density of oX + u? of X*? Reminder: if X has 
density f, then P(X < x) = one f(u)du. 


3.4 Positive definite matrices 


Material in this section will be used when we discuss generalized least 
squares (section 5.3). Detailed proofs are beyond our scope. An nxn orthog- 
onal matrix R has R’R = In x. (These matrices are also said to be “unitary.”’) 
Necessarily, RR’ = Inxn. Geometrically, R is a rotation, which preserves 
angles and distances; R can reverse certain directions. A diagonal matrix 
D is square and vanishes off the main diagonal: e.g., D,,; and D22 may be 
non-zero but Di2 = D2; = 0. Ann xn matrix G is non-negative definite if 

(i) Gis symmetric, and 

(ii) x’Gx > 0 for any n vector x. 
The matrix G is positive definite if x‘Gx > 0 for any n vector x except 
x = Onx1. (Non-negative definite matrices are also called “positive semi- 
definite.”) 


THEOREM 1. The matrix G is non-negative definite if and only if there 
is a diagonal matrix D whose elements are non-negative, and an orthogonal 
matrix R such that G = RDR’. The matrix G is positive definite if and only 
if the diagonal entries of D are all positive. 


The columns of R are the eigenvectors of G, and the diagonal elements of 
D are the eigenvalues. For instance, if c is the first column of R and à = D1, 
then Gc = ci. (This is because GR = RD.) It follows from theorem 1 that 


a non-negative definite G has a non-negative definite square root, G!/* = 
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RD'/? R’, where the square root of D is taken element by element. A positive 
definite G has a positive definite inverse, G7! = RD~!R’. (See exercises 
below.) If G is non-negative definite rather than positive definite, that is, 
x'Gx = 0 for some x Æ 0, then G is not invertible. Theorem 1 is an 
elementary version of the “spectral theorem.” 


Exercise set D 


1. 


2. 


Which of the following matrices are positive definite? non-negative 


definite? 
2 0 2 0 0 1 0 0 
0 1 0 0 1 0 1 0 
Hint: work out (u v) é 1) (4)= v) K e 


Suppose X is ann x p matrix with rank p < n. 
(a) Show that X’X is p x p positive definite. Hint: if c is p x1, what 
isc X'Xc? 
(b) Show that XX’ is n xn non-negative definite. 


For exercises 3—6, suppose R is an n xn orthogonal matrix and D is ann xn 
diagonal matrix, with D;; > 0 for alli. Let G = RDR’. Work the exercises 
directly, without appealing to theorem 1. 


3. 
4. 
5. 


Show that || Rx |] = ||x|| for any n x 1 vector x. 
Show that D and G are positive definite. 
Let /D be the nxn matrix whose i jth element is ,/D; je Show that 
VDD = D. Show also that RV DR'RV DR’ = G. 
Let DT! be the matrix whose ijth element is 0 for i 4 j, while the iith 
element is 1/Dj;. Show that D~! D = Inxn and RD~!R/G = Inxn. 
Suppose G is positive definite. Show that— 

(a) G is invertible and G~! is positive definite. 

(b) G has a positive definite square root G!/?. 

(c) GT! has a positive definite square root G~!/?. 


Let U be a random 3 x 1 vector. Show that cov(U) is non-negative 
definite, and positive definite unless there is a 3 x 1 fixed (i.e., non- 
random) vector such that c'U = c'E(U) with probability 1. Hints. Can 
you compute var(c’U) from cov(U)? If that hint isn’t enough, try the 
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case E(U) = 03x1. Comment: if c'U = c’ E(U) with probability 1, 
then U — E(U) concentrates in a fixed hyperplane. 


3.5 The normal distribution 


This is a quick review; proofs will not be given. A random variable X 
is N (u, o7) if it is normally distributed with mean jz and variance o*. Then 
the density of X is 


1 (x — yp)? 


ex 
ON 20 P| 2 o? 


| where exp(t) = é. 


If X is N(u, 0”), then (X — u)/o is N(0, 1), i.e., (X — u)/o is standard 
normal. The standard normal density is 


EEE E 
o(x) = T 5x ). 


Random variables X1, ..., Xn are jointly normal if all their linear combi- 
nations are normally distributed. If X1, X2 are independent normal variables, 
they are jointly normal, because a; X; + a2 X2 is normally distributed for any 
pair aj, a2 of real numbers. Later on, a couple of examples will involve 
jointly normal variables, and the following theorem will be helpful. (If you 
want to construct normal variables, see exercise 1 below for the method.) 


THEOREM 2. The distribution of jointly normal random variables is 
determined by the mean vector œ and covariance matrix G; the latter must 
be non-negative definite. If G is positive definite, the density of the random 
variables at x is 


1 | 1 
(Sal zC aG"(x—a)]. 


For any pair X1, X2 of random variables, normal or otherwise, if X1 and 
Xa are independent then cov(X;, X2) = 0. The converse is generally false, al- 
though counter-examples may seem contrived. For normal random variables, 
the converse is true: if X1, X2 are jointly normal and cov(X1, X2) = 0, then 
Xı and X3 are independent. 


The central limit theorem. With a big sample, the probability distribution 
of the sum (or average) will be close to normal. More formally, suppose 
X1, X2,... are independent and identically distributed with E(X;) = u and 
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var(X;) = o. Then S, = Xj + X2 +--+ Xn has expected value nu and 
variance no”. To standardize, subtract the expected value and divide by the 
standard error (the square root of the variance): 


Sn — np 


oyn 


The central limit theorem says that if n is large, the distribution of Zn is 
close to standard normal. For example, 


Zn = 


1 1 1 
P{\S,— = P{|Z 1 n —~x*) dx = 0.6827. 
[ISi] < oa} = PUZa <1) > = [exp (—53°) ax 


There are many extensions of the theorem. Thus, the sum of independent 
random variables with different distributions is asymptotically normal, pro- 
vided each term in the sum is only a small part of the total. There are also 
versions of the central limit theorem for random vectors. Feller (1971) has 
careful statements and proofs, as do other texts on probability. 


Terminology. (i) Symmetry is built into the definition of positive definite 
matrices. (ii) Orthogonal matrices have orthogonal rows, and the length of 
each row is 1. The rows are said to be “orthonormal.” Similar comments 
apply to the columns. (iii) “Multivariate normal” is a synonym for jointly 
normal. (iv) Sometimes, the phrase “jointly normal” is contracted to “nor- 
mal,” although this can be confusing. (v) “Asymptotically” means, as the 
sample size—the number of terms in the sum—gets large. 


Exercise set E 


1. Suppose G is n xn non-negative definite, and a is n x 1. 


(a) Find an n x 1 vector U of normal random variables with mean 0 
and cov(U) = G. Hint: let V be an n x 1 vector of independent 
N(0, 1) variables, and let U = G! V. 

(b) How would you modify the construction to get E(U) = a? 

2. Suppose R is an orthogonal n xn matrix. If U is ann x 1 vector of IID 
N(0, o?) variables, show that RU is an n x 1 vector of IID N(0, o?) 
variables. Hint: what is E(RU)? cov(RU)? (“IID” is shorthand for 
“independent and identically distributed.”’) 


3. Suppose € and ¢ are two random variables. If E(é¢) = E(é)E(C), 
are € and ¢ independent? What about the converse: if € and ¢ are 
independent, is E(é¢) = E(é)E(f)? 
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4. If U and V are random variables, show that cov(U, V) = cov(V, U) 
and var(U + V) = var(U) + var(V) + 2cov(U, V). Hint: what is 
[U - a) + (V - py? 

5. Suppose & and ¢ are jointly normal variables, with E (£) = a, var(Ẹ) = 
o*, E(¢) = B, var(¢) = t°, and cov(é, ¢) = pot. Find the mean and 
variance of € + ¢. Is € + ¢ normal? 


Comments. Exercises 6-8 prepare for the next chapter. Exercise 6 is covered, 
for instance, by Freedman-Pisani-Purves (2007) in chapter 18. Exercises 7 
and 8 are covered in chapters 20-21. 


6. Acoinis tossed 1000 times. Use the central limit theorem to approximate 
the chance of getting 475-525 heads (inclusive). 


7. A box has red marbles and blue marbles. The fraction p of reds is 
unknown. 250 marbles are drawn at random with replacement, and 102 
turn out to be red. Estimate p. Attach a standard error to your estimate. 


8. Let p be the estimator in exercise 7. 
(a) About how big is the difference between p and p? 
(b) Can you find an approximate 95% confidence interval for p? 


9. The “error function” W is defined as follows: 


W(x) 2 f 2d 
XY) SS —Uu u. 
VT JO P 


Show that W is the distribution function of |W |, where W is N (0, 07). 
Find o°. If Z is N (0, 1), how would you compute P(Z < x) from Y? 


10. IfU, V are IID N(0, 1), show that (U + V)/V/2, (U — V)/V2 are IID 
N(0, 1). 


3.6 If you want a book on matrix algebra 


Blyth TS, Robertson EF (2002). Basic Linear Algebra. 2nd ed. Springer. 
Clear, mathematical. 


Strang G (2005). Linear Algebra and Its Applications. 4th ed. Brooks Cole. 
Love it or hate it. 


Meyer CD (2001). Matrix Analysis and Applied Linear Algebra. SIAM. 
More of a conventional textbook. 


Lax PD (2007). Linear Algebra and its Applications. 2nd ed. Wiley. 
Graduate-level text. 


Multiple Regression 


4.1 Introduction 


In this chapter, we set up the regression model and derive the main results 
about least squares estimators. The model is 


(1) Y=XBte. 


On the left, Y is an nx 1 vector of observable random variables. The Y vector 
is the dependent or response variable; Y is being “explained” or “modeled.” 
As usual, Y; is the ith component of Y. 

On the right hand side, X is an n x p matrix of observable random 
variables, called the design matrix. We assume that n > p, and the design 
matrix has full rank, i.e., the rank of X is p. (In other words, the columns of 
X are linearly independent.) Next, $ is a px1 vector of parameters. Usually, 
these are unknown, to be estimated from data. The final term on the right 
is €, an n x 1 random vector. This is the random error or disturbance term. 
Generally, € is not observed. We write €; for the ith component of e. 

In applications, there is a Y; for each unit of observation i. Similarly, 
there is one row in X for each unit of observation, and one column for each 
data variable. These are the explanatory or independent variables, although 
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seldom will any column of X be statistically independent of any other column. 
Orthogonality is rare too, except in designed experiments. 

Columns of X are often called covariates or control variables, especially 
if they are put into the equation to control for confounding; “covariate” can 
have a more specific meaning, discussed in chapter 9. Sometimes, Y is called 
the “left hand side” variable. The columns in X are then (surprise) the “right 
hand side” variables. If the equation—like (1.1) or (2.7)—has an intercept, 
the corresponding column in the matrix is a “variable” only by courtesy: this 
column is all 1’s. 

We’ll write X; for the ith row of X. The matrix equation (1) unpacks 
into n ordinary equations, one for each unit of observation. For the ith unit, 
the equation is 


(2) Y; = Xip + éi. 


To estimate 6, we need some data—and some assumptions connecting 
the data to the model. A basic assumption is that 


(3) the data on Y are observed values of XB + €. 


We have observed values for X and Y, not the random variables themselves. 
We do not know £ and do not observe e. These remain at the level of concepts. 
The next assumption: 


(4) The e; are independent and identically distributed, with mean 0 and 


variance o°. 


Here, mean and variance apply to random variables not data; E (e€;) = 0, and 
var(€;) = a? is a parameter. Now comes another assumption: 


(5) If X is random, we assume € is independent of X. In symbols, e IL X. 


(Note: ILL.) Assumptions (3)-(4)-(5) are not easy to check, because € is 
not observable. By contrast, the rank of X is easy to determine. 

A matrix X is “random” if some of the entries X;; are random variables 
rather than constants. This is an additional complication. People often prefer 
to condition on X. Then X is fixed; expectations, variances, and covariances 
are conditional on X. 

We will estimate 6 using the OLS (ordinary least squares) estimator: 


(6) B = (X’X) |X’, 


as in exercise 3B14 (shorthand for exercise 14, set B, chapter 3). This B isa 
px 1 vector. Let 


(7) e=Y— XÊ. 
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This is an x 1 vector of “residuals” or “errors.” Exercise 3B14 suggests the 
origin of the name “least squares:” a sum of squares is being minimized. The 
exercise contains enough hints to prove the following theorem. 


THEOREM 1. 
(i) el X. 
(ii) As a function of the px1 vector y, ||Y — Xy ||? is minimized when 


y=. 
THEOREM 2. OLS is conditionally unbiased, that is, £ (IX )= p. 


Proof. To begin with, 8 = (X’X)~!X’Y: see (6). The model (1) says 
that Y = XB + €, so 


Ê = (X'X) | X(XB +€) 
= (X'X) 1 X'XB + (X'X) |X 
= P(e Xe, 


For the last step, (X’X)~1X’X = (X’'X)1(X'X) = Ipxp and IpxpB = P. 
Thus, 


(8) B=B+n where n= (X'X)! X'e. 


Now E(n|X) = E((X’X)7!X'e|X) = (XX)! X’E(e|X). We've condi- 
tioned on X, so X is fixed (not random). Ditto for matrices that only depend 
on X. They factor out of the expectation (exercise 3C3). What we’ve shown 
so far is 


(9) E(B|X) = B + (X'X) 7! X'E(e|X). 


Next, X ILe by assumption (5): conditioning on X does not change the dis- 
tribution of e. But E(€) = 0,1 by assumption (4). Thus, E(8|X) = £, 
completing the proof. 


Example 1. Hooke’s law (section 2.3, i.e., section 3 in chapter 2). Look 
at equation (2.7). The parameter vector £ is 2 x 1: 


The design matrix X is 6x2. The first column is all 1’s, to accommodate 
the intercept a. The second column is the column of weights in table 2.1. In 
matrix form, then, the model is Y = XB + €, where 
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Yı 1 (0) €] 
Y> 1 2 e 
Y3 1 4 3 
Y = X = = a 
¥4 |? 1 6f So Dey 
Ys 1 8 €5 
Y6 1 10 €6 


Let’s check the first row. Since X; = (1 0), the first row in the matrix 
equation says that Yı = XiB + €1 = a +0b +<€ı = a + €11. This is 
equation (2.7) for i = 1. Similarly for the other rows. 

We want to compute Ê from (6), so data on Y are needed. That is where 
the “length” column in table 2.1 comes into the picture. The model says that 
the lengths of the spring under the various loads are the observed values of 
Y = XP + €. These observed values are 


439.00 
439.12 
439.21 
439.31 
439.40 
439.50 


Now we can compute the OLS estimates from (6). 


Ê = (X'X) | X'Y 


1 0\J-! 439.00 

1 2 439.12 

J11 111 die Vee 4|] 11111 1\|8921 
Tito 2 4 6 8 10J|1 6 02 4 6 8 10/)] 439.31 
| 1 8 l 439.40 

1 10 439.50 


_ (439.01 cm 
~ (05 cm/kg J ` 


Exercise set A 


1. Inthe regression model of section 1, one of the following is always true 
and the other is usually false. Which is which, and why? 


(i)e LX (ii)e LX 
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2. 


4.2 


In the regression model of section 1, one of the following is always true 
and the other is usually false. Which is which, and why? 


(i)e Lx (ii) e IL X 
Does e L X help validate assumption (5)? 


Suppose the first column of X is all 1’s, so the regression equation has 
an intercept. 


(a) Show that `; e; = 0. 

(b) Does $`; e; = 0 help validate assumption (4)? 

(c) Is $; €; = 0? Oris J`; €; around oyn in size? 
Show that (i) E(e'e|X) = no? and (ii) cov(e|X) = E(ee’|X) = 
E lixis 
How is column 2 in table 2.1 related to the regression model for Hooke’s 
law? (Cross-references: table 2.1 is table 1 in chapter 2.) 


Yule’s regression model (1.1) for pauperism can be translated into matrix 
notation: Y = Xß + €. We assume (3)-(4)-(5). For the metropolitan 
unions and the period 1871-81: 


(a) What are X and Y? (Hint: look at table 1.3.) 
(b) What are the observed values of X41? X42? Y4? 


(c) Where do we look in (X’X)~! X’Y to find the estimated coefficient 
of AOut? 


Note. These days, we use the computer to work out (X’X)~!X’Y. Yule 
did it with two slide rules and the “Brunsviga Arithmometer”—a pin- 
wheel calculating machine that could add, subtract, multiply, and divide. 


Standard errors 


Once we’ve computed the regression estimates, we need to see how 


accurate they are. If the model is right, this is pretty easy. Standard errors 
do the job. The first step is getting the covariance matrix of 6. Here is a 
preliminary result. 


(10) 


cov(B|X) = (X'X) "| X’cov(e|X)X (XX) 1. 


To prove (10), start from (8): 


Ê = B + (XX)! X'e. 


Conditionally, X is fixed; so are matrices that only involve X. If you keep 
in mind that X’X is symmetric and (ABY = B'A’, exercises 3C4-5 will 
complete the argument for (10). 
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THEOREM 3. cov(B|X) = 02(X'X)7!. 
Proof. The proof is immediate from (10) and exercise A5. 


Usually, o? is unknown and has to be estimated from the data. If we 
knew the €;, we could estimate o? as 


lo 4 
— e. 
n B : 

i=1 

But we don’t know the e’s. The next thing to try might be 
1 n 
2 

= e. 
n 2 t 

i=l 


This is a little too small. The e; are generally smaller than the €;, because B 
was chosen to make the sum of the e? as small as possible. The usual fix is 
to divide by the degrees of freedom n — p rather than n: 


1 n 
(11) ê? = e. 
n—p 2 í 
Now 6? is conditionally unbiased (theorem 4 below). Equation (11) is the 


reason we need n > pnotjustn > p. Ifn = p, the estimator 6? is undefined: 
you would get 0/0. See exercise B12 below. 

The proof that ô? is unbiased is a little complicated, so let’s postpone it 
for a minute and look at the bigger picture. We can estimate the parameter 
vector £ in the model (1) by OLS: Ê = (X’X)~!X’Y. Conditionally on X, 
this estimator is unbiased, and the covariance matrix is 07(X’X)~!. All is 
well, except that ø? is unknown. We just plug in 6”, which is (almost) the 
mean square of the residuals—the sum of squares is divided by the degrees 
of freedom n — p not by n. To sum up, 


(12) cov(B|X) = 67(X’X) 1. 


The variances are on the diagonal. Variances are the wrong size and have the 
wrong units: take the square root of the variances to get the standard errors. 
(What are the off-diagonal elements good for? You will need the off-diagonal 
elements to compute the standard error of, e.g., bo — b3. See exercise B14 
below. Also see theorem 5.1, and the discussion that follows. ) 
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Back to the mathematics. Before tackling theorem 4, we discuss the 
“hat matrix,” 


(13) H = X(Xx'x)7!X’, 
and the “predicted” or “fitted” values, 
(14) Y = Xp. 


The terminology of “predicted values” can be misleading, since these are 
computed from the actual values. Nothing is being predicted. “Fitted values” 
is better. 

The hat matrix is n xn, because X isn x p, X'X is px p, (X’X)~! is 
px p, and X’ is pxn. On the other hand, Y isn x1. The fitted values are 
connected to the hat matrix by the equation 


(15) Y = X(X'X) | X’Y = HY. 


(The equation, and the hat on Y, might explain the name “hat matrix.”) Check 
these facts, with J,,,., abbreviated to Z: 


Gj) e = (I — H)Y. 
(ii) H is symmetric, and so is J — H. 
Gii) H is idempotent (H? = H), and sois I — H. 
Gv) X is invariant under H, that is, HX = X. 
(v) e=Y-AY1X. 
Thus, H projects Y into cols X, the column space of X. In more detail, 


HY =Y = Xf € cols X, and Y — HY = e is orthogonal to cols X by (v). 
Next, 


(vi) (I — H)X = 0. 
(vii) (Z — H)H = H(I — H) = 0. Hint: use fact (iii). 


THEOREM 4. E(G7|X) = 0°. 

Proof. We claim that 
(16) e= (I — Abe. 
Indeed, by facts (i) and (vi) about the hat matrix, 


(17) e = (I — H)Y = (I — H)(X + €) = (I — H)e. 
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We write H for Inxn — H, and claim that 
(18) lel? = €' He. 


Indeed, H is symmetric and idempotent—facts (ii) and (iii) about the hat 
matrix—so |e]? = e'e = e' H*e = €' He, proving (18). Check that 


(19) E(é'He|X) = E(X refije)|X) 
f21 921 
= 00 Ei Hiyje|X) =} > Ay E(aelX). 


i=l j=l i=l j=l 


The matrix H is fixed, because we conditioned on X, so H; j factors out of 
the expectation. 

The next step is to simplify the double sum on the right of (19). Condi- 
tioning on X doesn’t change the distribution of €, because e IL X. Ifi Æ j, 
then E (eiej|X) = 0 because e; and e; are independent with E(e;) = 0. On 
the other hand, E(eje;|X) = o°. The right hand side of (19) is therefore 
o7trace(H). Thus, 


n 
(20) E(é'He|X) = 07S) Hii = o trace(H). 
i=l 


By (18) and (20), : 
E(|lell?|X) = o7trace(A). 


_ Now we have to work out the trace. Remember, H = X (X 'X)~!X’ and 
H = Inxn — H. By exercise 3B11, 


trace(H) = trace[(X’X) | X’X] = trace(Ipxp) = p. 
So trace(H) = trace(Inxn — H) = trace(Inxn) — trace(H) = n — p. Now 
(21) E(llel?|X) = o°( — p). 


To wrap things up, 


a 1 
E(6?|X) = —— B(llel?|X) = Papa 


completing the proof of theorem 4. 
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Things we don’t need 


Theorems 1—4 show that under certain conditions, OLS is a good way to 
estimate a model; also see theorem 5.1 below. There are a lot of assumptions 
we don’t need to make. For instance— 

e The columns of X don’t have to be orthogonal to each other. 
e The random errors don’t have to be normally distributed. 


Exercise set B 


The first five exercises concern the regression model (1)—-(5), and X; denotes 
the ith row of the design matrix X. 
1. True or false: E(Y;|X) = X78. 
2. True or false: the sample mean of the Y;’s is Y = n~! ~"_, Y;. IsYa 
random variable? 


3. True or false: var(¥;|X) = 0°. 


4. True or false: the sample variance of the Y;’s isn~! )7”_,(¥; — Y)°. (If 
you prefer to divide by n — 1, that’s OK too.) Is this a random variable? 


5. Conditionally on X, show that the joint distribution of the random vectors 
(6 — B, e) is the same for all values of 6. Hint: express (6 — £, e) in 
terms of X and e. 


6. Can you put standard errors on the estimated coefficients in Yule’s equa- 
tion (1.2)? Explain briefly. Hint: see exercise A7. 


7. In section 2.3, we estimated the intercept and slope for Hooke’s law. 
Can you put standard errors on these estimates? Explain briefly. 


8. Here are two equations: 
G)Y=XB+e (ii) Y=XB+e 
Which is the regression model? Which equation has the parameters and 


which has the estimates? Which equation has the random errors? Which 
has the residuals? 


9. We use the OLS estimator B in the usual regression model, and the 
unbiased estimator of variance 67. Which of the following statements 
are true, and why? 


(i) cov(B) = 07(X'X)7!. 
(ii) cov(B) = 02(X'X)7!. 
Gii) cov(B|X) = o2(X'X)7!. 
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10. 


11. 


12. 


13. 


14. 


15. 
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(iv) cov(B|X) = 62(X’X)—!. 
(v) cov(B|X) = 67(X'X)7!. 
True or false, and explain. 


(a) If you fit a regression equation to data, the sum of the residuals is 0. 
(b) If the equation has an intercept, the sum of the residuals is 0. 


True or false, and explain. 

(a) In the regression model, E(Y|X) = XB. 

(b) In the regression model, E(Y|X) = Xp. 

(c) In the regression model, E(Y|X) = XB. 
If X is nxn with rank n, show that X(X’X)~!X! = Inxn, so f =Y. 
Hint: is X invertible? 

Suppose there is an intercept in the regression model (1), so the first 
column of X is all 1’s. Let Y be the mean of Y. Let X be the mean of 
X, column by column. Show that Y = XB. 
Let B be the OLS estimator in (1), where the design matrix X has full 
rank p < n. Assume conditions (4) and (5). 

(a) Find var (By — bo |X), where Bi is the ith component of B. 

(b) Suppose c is px 1. Show that E(c'B|X) = c'B and var(c’B|X) = 

oc (XX) 

(Hard.) Suppose Y; = a+ bX; + €; fori = 1,...,n, the €; being IID 
with mean 0 and variance o°, independent of the X;. (Reminder: IID 
stands for “independent and identically distributed.”) Equation (2.5) 
expressed a, Ê in terms of five summary statistics: two means, two SDs, 
and r. Derive the formulas for â, b from equation (6) in this chapter. 
Show also that, conditionally on X, 


2 
SE4 = —,/14 a SE = < 
Jn  va(X)’ sx Jn’ 


shy dfn 1~ Re 
Xa XX var(X) = — Soi - X), s% =var(X). 
n n 
i=1 i=l 


Hints. The design matrix M will be n x2. What is the first column? the 
second? Find M’M. Show that det(M’M) = n2var(X). Find (M’M)7! 
and M'Y. 
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4.3 Explained variance in multiple regression 


After fitting the regression model, we have the equation Y = XB +e. 
All the quantities are observable. Suppose the equation has an intercept, so 
there is a column of 1’s in X. We will show in a bit that 


(22) var(Y) = var(XA) + var(e). 


To define var(Y), think of Y as a data variable: 

1 n 
23 Y)= -X `; - Y’. 
(23) var(Y) = = 2 ) 


Variances on the right hand side of (22) are defined in a similar way: var(X B ) 
is called “explained variance,” and var(e) is “unexplained” or “residual” vari- 
ance. The fraction of variance “explained” by the regression is 
(24) R? = var(Xf)/var(Y). 

The proof of (22) takes some algebra. Let u be an n x 1 column of 
1’s, corresponding to the intercept in the regression equation. Recall that 
Y = XP + e. As always, e L X, so e = 0. Now 
(25) Y — Yu = XB — Yu + e. 
Since e L X and e L u, equation (25) implies that 
(26) IY — Yul? = XÊ — Yul? + lel’. 
Since € = 0, 
(27) Y=xĝ=XÂ: 
see exercise B13. Now ||Y — Yul? = nvar(Y) by (23); IX — Yull? = 
nvar(XB) by (27); and lel? = nvar(e) because Z = 0. From these facts 
and (26), 
(28) nvar(Y) = nvar(XB) + nvar(e). 


Dividing both sides of (28) by n gives equation (22), as required. 
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Sacramento 


e 
A 


Stockton 
o 


e 
San Francisco 


The math is fine, but the concept is a little peculiar. (Many people talk 
about explained variance, perhaps without sufficient consideration.) First, as 
a descriptive statistic, variance is the wrong size and has the wrong units. 
Second, well, let’s take an example. Sacramento is about 78 miles from San 
Francisco, as the crow flies. Or, the crow could fly 60 miles East and 50 miles 
North, passing near Stockton at the turn. If we take the 60 and 50 as exact, 
Pythagoras tells us that the squared hypotenuse in the triangle is 


60? + 507 = 3600 + 2500 = 6100 miles”. 


With “explained” as in “explained variance,’ the geography lesson can be 
cruelly summarized. The area—squared distance—between San Francisco 
and Sacramento is 6100 miles”, of which 3600 is explained by East. 

The analogy is exact. Projecting onto East stands for (i) projecting Y 
and X orthogonally to the vector u that is all 1’s, and then (ii) projecting the 
remainder of Y onto what is left of the column space of X. The hypotenuse 
of the triangle is Y — Yu, with squared length ||Y — Yul||* = nvar(Y). The 
horizontal edge is XB — Yu, with ||X8 — Yu||? = nvar(XB). The vertical 
edge is e, and |le||? = nvar(e). The theory of explained variance boils 
down to Pythagoras’ theorem on the crow’s triangular flight. Explaining the 
area between San Francisco and Sacramento by East is zany, and explained 
variance may not be much better. 

Although “explained variance” is peculiar terminology, R? is a useful 
descriptive statistic. High R* indicates a good fit between the data and the 
equation: the residuals are small relative to the SD of Y. Conversely, low 
R? indicates a poor fit. In fields like political science and sociology, R? < 
1/10 is commonplace. This may indicate large random effects, difficulties in 
measurement, and so forth. Or, there may be many important factors omitted 
from the equation, which might raise questions about confounding. 
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Association or causation? 


R? measures goodness of fit, not the validity of any underlying causal 
model. For example, over the period 1950-1999, the correlation between the 
purchasing power of the United States dollar each year and the death rate from 
lung cancer in that year is —0.95. So R? = (—0.95)? = 0.9, which is a lot 
bigger than what you find in run-of-the-mill regression studies of causation. 
If you run a regression of lung cancer death rates on the purchasing power of 
the dollar, the data will follow the line very closely. 

Inflation, however, neither causes nor prevents lung cancer. The pur- 
chasing power of the dollar was going steadily downhill from 1950 to 1999. 
Death rates from lung cancer were generally going up (with a peak in 1990). 
These facts create a high R*. Death rates from lung cancer were going up 
because of increases in smoking during the first half of the century. And the 
value of the dollar was shrinking because, well, let’s not go there. 


Exercise set C 


1. (Hard.) For a regression equation with an intercept, show that R? is the 
square of the correlation between Y and Y. 


4.4 What happens to OLS if the assumptions break down? 

If E(e|X) Æ Onx1, the bias in the OLS estimator is (X'X)~! X’E(e|X), 
by equation (9). If E(e|X) = Onx1 but cov(e|X) 4 o7Inxn, OLS will be 
unbiased. However, theorem 3 breaks down: see equation (10) and section 
5.3 below. Then G7(X’X)~! may be a misleading estimator of cov(6|X). 


If the assumptions behind OLS are wrong, the estimator can be 
severely biased. Even if the estimator is unbiased, standard errors 
computed from the data can be way off. Significance levels would 
not be trustworthy, for these are based on the SEs (section 5.6 
below), 


4.5 Discussion questions 
Some of these questions cover material from previous chapters. 


1. Inthe OLS regression model— 
(a) Is it the residuals that are independent from one subject to another, 
or the random errors? 
(b) Is it the residuals that are independent of the explanatory variables, 
or the random errors? 
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(c) Is it the vector of residuals that is orthogonal to the column space 
of the design matrix, or the vector of random errors? 


Explain briefly. 


In the OLS regression model, do the residuals always have mean 0? 
Discuss briefly. 


True or false, and explain. If, after conditioning on X, the disturbance 
terms in a regression equation are correlated with each other across 
subjects, then— 

(a) the OLS estimates are likely to be biased; 

(b) the estimated standard errors are likely to be biased. 


An OLS regression model is defined by equation (2), with assumptions 
(4) and (5) on the e€’s. Are the Y; independent? identically distributed? 
Discuss briefly. 


You are using OLS to fit a regression equation. True or false, and explain: 


(a) If you exclude a variable from the equation, but the excluded vari- 
able is orthogonal to the other variables in the equation, you won’t 
bias the estimated coefficients of the remaining variables. 


(b 


wm 


If you exclude a variable from the equation, and the excluded vari- 
able isn’t orthogonal to the other variables, your estimates are going 
to be biased. 


(c) If you put an extra variable into the equation, you won’t bias the es- 
timated coefficients—as long as the error term remains independent 
of the explanatory variables. 


(d 


wm 


If you put an extra variable into the equation, you are likely to bias 
the estimated coefficients—if the error term is dependent on that 
extra variable. 


True or false, and explain: as long as the design matrix has full rank, the 
computer can find the OLS estimator £. If so, what are the assumptions 
good for? Discuss briefly. 


Does R? measure the degree to which a regression equation fits the data? 
Or does it measure the validity of the model? Discuss briefly. 


Suppose Y; = au; + by; + €; fori = 1,..., 100. The <; are IID with 
mean 0 and variance 1. The w’s and v’s are fixed not random; these two 
data variables have mean 0 and variance 1: the correlation between them 
is r. If r = +1, show that the design matrix has rank 1. Otherwise, 
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10. 


11. 


12. 


let â, b be the OLS estimators. Find the variance of â; of 6; of â — b. 
What happens if r = 0.99? What are the implications of collinearity for 
applied work? For instance, what sort of inferences about a and b are 
made easier or harder by collinearity? 


Comments. Collinearity sometimes means r = +1; more often, it 


means r = +1. A synonym is multicollinearity. The case r = +1 is 
better called exact collinearity. Also see lab 7 at the back of the book. 


. True or false, and explain: 


(a) Collinearity leads to bias in the OLS estimates. 


(b) Collinearity leads to bias in the estimated standard errors for the 
OLS estimates. 

(c) Collinearity leads to big standard errors for some estimates. 

Suppose (X;, W;, €;) are IID as triplets across subjects i = 1,...,n, 

where n is large; E(X;) = E(W;) = E(e;) = 0, and €; is independent 

of (X;, Wi). Happily, X; and W; have positive variance; they are not 

perfectly correlated. The response variable Y; is in truth this: 


Y; = aX; +bW, +6;. 


We can recover a and b, up to random error, by running a regression of 
Y; on X; and W;. No intercept is needed. Why not? What happens if 
X; and W; are perfectly correlated (as random variables)? 


(This continues question 10.) Tom elects to run a regression of Y; on X;, 
omitting W;. He will use the coefficient of X; to estimate a. 

(a) What happens to Tom if X; and W; are independent? 

(b) What happens to Tom if X; and W; are dependent? 

Hint: see exercise 3B15. 


Suppose (X;, 6;, €;) are IID as triplets across subjects i = 1,...,n, 
where n is large; and Xj, 6;, €; are mutually independent. Furthermore, 
E(X;) = E(D) = E(e;) = 0 while E(X;*) = E(6;”) = 1 and E(e;?) = 
o? > 0. The response variable Y; is in truth this: 


Y; =aXi + éi. 


We can recover a, up to random error, by running a regression of Y; on 
X;. No intercept is needed. Why not? 
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(Continues question 12.) Let c,d,e be real numbers and let W; = 
cX; + dôi + eci. Dick elects to run a regression of Y; on X; and W;, 
again without an intercept. Dick will use the coefficient of X; in his 
regression to estimate a. If e = 0, Dick still gets a, up to random 
error—as long as d # 0. Why? And what’s wrong with d = 0? 


(Continues questions 12 and 13.) Suppose, however, that e 4 0. Then 
Dick has a problem. To see the problem more clearly, assume that n is 
large. Let Q = (x Ww) be the design matrix, i.e., the first column is the 
X; and the second column is the W;. Show that 


E(X?) a) ea ) 


QQ/n= a E(W?) E(W:Y;) 


(a) Suppose a = c = d = e = 1. What will Dick estimate for the 
coefficient of X; in his regression? 


(b) Suppose a = c = d = 1 ande = —1. What will Dick estimate for 
the coefficient of X; in his regression? 


(c) A textbook on regression advises that, when in doubt, put more 
explanatory variables into the equation, rather than fewer. What do 
you think? 


There is a population consisting of N subjects, with data variables x and 
y. A simple regression equation can in principle be fitted by OLS to the 
population data: y; = a + bx; + ui, where y uj = Lli xju; = Q. 
Although Harry does not have access to data on the full population, 
he can take a sample of size n < N, at random with replacement: 
n is moderately large, but small relative to N. He will estimate the 
parameters a and b by running a regression of y; on x; for i in the 
sample. He will have an intercept in the equation. 


(a) Are the OLS estimates biased or unbiased? Why? (Hint: is the true 
relationship linear?) 

(b) Should he believe the standard errors printed out by the computer? 
Discuss briefly. 


Over the period 1950-99, the correlation between the size of the popu- 
lation in the United States and the death rate from lung cancer was 0.92. 
Does population density cause lung cancer? Discuss briefly. 
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17. (Hard.) Suppose X1,...,X; are dependent random variables. They 
have a common mean, E(X;) = a. They have a common variance, 
var(X;) = o°. Let rij be the correlation between X; and X; fori Æ j. 
Let 


1 
= aa- i) DR 


1<if#j<n 
be the average correlation. Let S, = X1 +---+ Xn. 


(a) Show that var(S,) = no? + n(n — 1)o?r. 


S 1 -1 
(b) Show that var( 2) = -0° 4 r o?r. 
n n n 


Hint for (a): 


[do —a)} = So X-a + Yo Xi -aX — a). 
i=l 


i=l 1<iAj<n 


Notes. (i) There are n(n — 1) pairs of indices (i, j) with1 <i Æj <n. 
Gi) If n = 100 andr = 0.05, say, var(S,/n) will be a lot bigger than 
o7/n. Small correlations are hard to spot, so casual assumptions about 
independence can be quite misleading. 


18. Let A stand for the percentage difference from 1871 to 1881 and let 
i range over the 32 metropolitan unions. Yule’s model (section 1.4) 
explains APaup; in terms of AOut;, AOld;, and APop;. 


(a) Is option (i) below the regression model, or the fitted equation? 
What about (ii)? 
(b) In (i), is b a parameter or an estimate? What about 0.755 in (ii)? 
(c) In (i), is e; an observable residual or an unobservable error term? 
What about e; in (11)? 
(G) APaup; = a+ b x AOut; + cx AOld; + dx APop; + €i, 
the €; being IID with mean 0 independent of the explanatory variables. 


(i) APaup; = 13.19 + 0.755 AOut; — 0.022 AOld; — 0.322 APop; + ei, 
the e; having mean 0 with e orthogonal to the explanatory variables. 


58 


19. 


20. 


21. 


CHAPTER 4 


A box has N numbered tickets; N is known; the mean u of the numbers 
in the box is an unknown parameter; the variance o? of the numbers in 
the box is another unknown parameter. We draw n tickets at random 
with replacement: X41 is the first draw, X2 is the second draw, ..., Xn 
is the nth draw. Fill in the blanks, using the options below: 
is an unbiased estimator for . Options: 

Gn (ii)o* Gii E(X1) 

Pete 6 ato sa aaa 
(iv) 

n 
(v) None of the above 


(This continues question 19.) Let 


Xi tXot-- + Xn 


X= 
n 
True or false: 
(a) The X; are IID. 
(b) E(X;) = u for all i. 
(c) E(X;) = X for all i. 
(d) var(X;) = o? for all i. 
jy CS! SO SOA oe 
n 
D X-X + (Xo Ky + t Xn- X = ø? if nis large. 
n 


Labrie et al (2004) report on a randomized controlled experiment to 
see whether routine screening for prostate cancer reduces the death rate 
from that disease. The experimental subjects consisted of 46,486 men 
age 45-80 who were registered to vote in Quebec City. The investigators 
randomly selected 2/3 of the subjects, inviting them to annual screening. 
The other 1/3 of the subjects were used as controls. Among the 7,348 
men who accepted the invitation to screening, 10 deaths from prostate 
cancer were observed during the first 11 years following randomization. 
Among the 14,231 unscreened controls, 74 deaths from prostate cancer 
were observed during the same time period. The ratio of death rates 
from prostate cancer is therefore 


10/7,348 


1 = 0.26, 
74/14,231 
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Le., screening cuts the death rate by 74%. Is this analysis convincing? 
Answer yes or no, and explain briefly. 


22. Inthe HIP trial (chapter 1), women in the treatment group who refused 
screening were generally at lower risk of breast cancer. What is the 
evidence for this proposition? 


4.6 End notes for chapter 4 


Conditional vs unconditional expectations. The OLS estimate involves 
an inverse, (X'X)~!. If everything is integrable, then OLS is uncondition- 
ally unbiased. Integrability apart, conditionally unbiased is the stronger and 
more useful property. In many conventional models, X’X is relatively con- 
stant when n is large. Then there is little difference between conditional and 
unconditional inference. 


Consistency and asymptotic normality. Consider the OLS estimator B 
in the usual model (1)-(5). One set of regularity conditions that guarantees 
consistency and asymptotic normality of Ê is the following: p is fixed, n is 
large, the elements of X are uniformly o(./n), and X’X = nV + o(n) with 
V a positive definite p x p matrix. Furthermore, under this set of condi- 
tions, the F-statistic is asymptotically x A / Po when the null hypothesis holds 
(sections 5.6-7). For additional discussion, see 


http://www.stat.berkeley.edu/users/census/Ftest.pdf 


Explained variance. One point was elided in section 3. If Q projects 
orthogonally to the constant vectors, we must show that the projection of QY 
on QX is XÊ — Y. To begin with, QY = Y — Y and QX = X — X. Now 
Y — Y = XÊ —Y +e = (X — X)ĝ +e = (QX)Ê + e because Y = XP. 
Plainly, e L QX, completing the argument. 

The discussion questions. Questions 7 and 16 are about the interpretation 
of R*. Questions 8-9 are about collinearity: the general point is that some 
linear combinations of the B’s will be easy to estimate, and some—the c’B 
with Xc = O0—will be very hard. (Collinearity can also make results more 
sensitive to omitted variables and to data entry errors.) Questions 10-15 
look at assumptions in the regression model. Question 11 gives an example 
of omitted-variables bias when W is correlated with X. In question 14, 
if W is correlated with 6, then including W creates endogeneity bias (also 
called simultaneity bias). Question 15 is a nice test case: do the regression 
assumptions hold in a sampling model? Also see discussion question 6 in 
chapter 5, and 


www.stat.berkeley.edu/users/census/badols.pdf 
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Questions 3 and 17 show that independence is the key to estimating 
precision of estimates from internal evidence. (Homoscedasticity—the as- 
sumption of constant variance—is perhaps of lesser importance.) Of course, 
if the mode of dependence is known, adjustments can be made. Generally, 
such things are hard to know; assumptions are easy to make. Questions 18- 
20 review the distinction between parameters and estimates; questions 21—22 
review material on design of experiments from chapter 1. 


Data sources. In section 3 and discussion question 16, lung cancer death 
rates are for males, age standardized to the United States population in 1970, 
from the American Cancer Society. Purchasing power of the dollar is based 
on the Consumer Price Index: Statistical Abstract of the United States, 2000, 
table 767. Total population is from Statistical Abstract of the United States, 
1994, 2000, table 2; the 1994 edition was used for the period 1950-59. 


Spurious correlations. Hendry (1980, figure 8) reports an R? of 0.998 
for predicting inflation by cumulative rainfall over the period 1964-75: both 
variables were increasing steadily. (The equation is quadratic, with an adjust- 
ment for autocorrelation.) Yule (1926) reports an R? of 0.9 between English 
mortality rates and the percentage of marriages performed in the Church of 
England over the period 1886-1911: both variables were declining. Hans 
Melberg provided the citations. 
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Multiple Regression: Special Topics 


5.1 Introduction 


This chapter covers more specialized material, starting with an optimal- 
ity property for OLS. Generalized Least Squares will be the next topic; this 
technique is mentioned in chapter 6, and used more seriously in chapters 
8—9. Then comes normal theory, featuring t, x, and F. Finally, there is an 
example to demonstrate the effect of data snooping on significance levels. 


5.2 OLS is BLUE 
The OLS regression model says that 


(1) Y = XB +e, 


where Y is ann x 1 vector of observable random variables, X is ann x p 
matrix of observable random variables with rank p < n, and € is ann x 1 
vector of unobservable random variables, IID with mean 0 and variance o°, 
independent of X. In this section, we’re going to drop the independence as- 


sumptions about €, and make a weaker—less restrictive—set of assumptions: 
2 
(2) E(e|X) =Onx1, cov(e|X) = O° Inxn- 


Theorems 1—4 in chapter 4 continue to hold (exercise A2 below). 
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The weaker assumptions will be more convenient for comparing GLS 
(Generalized Least Squares) to OLS. That is the topic of the next section. 
Here, we show that OLS is optimal—among linear unbiased procedures— 
in the case where X is not random. Condition (2) can then be stated more 
directly: 


(3) E(€.)=nx1, cov(€) = o° Inxn. 


THEOREM 1. Gauss-MARKOv. Suppose X is fixed (i.e., not random). 
Assume (1) and (3). The OLS estimator is BLUE. 


The acronym BLUE stands for Best Linear Unbiased Estimator, i.e., the 
one with the smallest variance. Let y = c’B, where c is px 1: the parameter 
y is a linear combination of the components of 6. Examples would include 
B1, or b2 — B3. The OLS estimator for y is 7 = c'Ê = c'(X'X)!X’Y. This 
is unbiased by (3), and var(?) = o7c'(X'X)~!c. Cf. exercise Al below. Let 
y be another linear unbiased estimator for y. Then var(y) > var(ŷ), and 
var (7) = var(y) entails y = y. That is what the theorem says. 


Proof. A detailed proof is beyond our scope, but here is a sketch. Recall 
that X is fixed. Since y is by assumption a linear function of Y, there is an 
nx 1 vector d with y = d'Y = d'XB + d'e. Then E(/) = d'XB by (3). 
Since y is unbiased, d'X8 = c’f for all £. Therefore, 

(4) axe. 

Let q = d — X(X'X)~!c, ann x 1 vector. So 

(5) q' =d' —c(X'X)'X’. 

(Why is q worth thinking about? Because y — y = q'Y.) Multiply (5) on 
the right by X: 


(6) q'X = d'X — c'(X'XY I X'X 
= d'X —c =01xp 
by (4). From (5), d’ = q' + c'(X'X)~!X’. By exercise 3C4, 
var(y) = var (d'e) 
o°d'd 
= o° [|q + (XX) 'X'] [q + X(X'X) c] 
= o° [q'q + '(X'X) tc] 
= o°q'q + var(ĵ). 
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The cross-product terms dropped out: g’X (X’X)~!c = c'(X'X)!X'q =0, 
because q’X = 01x p—and X’q = 0px1—by (6). Finally, q'q4 = >; q? >0. 


The inequality is strict unless q = 0, x1, i.e., = y. This completes the 
proof. 


Exercise set A 


1. Let Ê be the OLS estimator in (1), where the design matrix X has full 
rank p < n. Assume (2). 
(a) Show that E(Y|X) = XB and cov(Y|X) = 07 Inxn. Verify that 
E(B|X) = B and cov(B|X) = o2(X'X)7!. 
(b) Suppose c is px 1. Show that E(c’B|X) = c’B and var(c'B|X) = 
oc! (X’X)~e. 
Hint: look at the proofs of theorems 4.2 and 4.3. Bigger hint: look at 
equations (4.8—10). 
2. Verify that theorems 4.1—4 continue to hold, if we replace conditions 
(4.4-5) with condition (2) above. 


5.3 Generalized least squares 


We now keep the equation Y = Xf + €, but change assumption (2) to 
(7) E(e|X) =Onx1, cov(e|X) = G, 


where G is a positive definite n x n matrix. This is the GLS regression model 
(X is assumed n x p with rank p < n). So the OLS estimator Bots can be 
defined by (4.6) and is unbiased given X by (4.9). However, the formula for 
cov(Bors|X ) in theorem 4.3 no longer holds. Instead, 


(8) cov(Bois|X) = (X'X) 1 X'GX(X'x) 1. 


See (4.10), and exercise B2 below. Moreover, Bots is no longer BLUE. Some 
people regard this as a fatal flaw. 

The fix—if you know G —is to transform equation (1). You multiply on 
the left by G~!/?, getting 


(9) (GY) = (G"'?x)p + (E71e). 
(Why does G~!/? make sense? See exercise 3D7.) The transformed model 


has G~'!/2Y as the response vector, G~'/2X as the design matrix, and G7 1/2¢ 
as the vector of disturbances. The parameter vector is still 6. Condition (2) 
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holds (with ø? = 1) for the transformed model, by exercises 3C3 and 3C4. 
That was the whole point of the transformation—and the main reason for 
introducing condition (2). 

The GLS estimator for p is obtained by applying OLS to (9): 


P —1 

Ree [ (cx; (6-"?x)] (G-"2x\ G72. 
Since (AB) = B’A’ and G7 12G! = G7!, 
(10) Bots = (X'GTIX) I X'GT IY. 


Exercise B1 below shows that X’G~!X on the right hand side of (10) is 
invertible. Furthermore, X is nx p, so X' is pxn while G and G~! arenxn. 
Thus, X’G~!X is px p: and crs is p x 1, as it should be. By theorem 4.2, 


(11) the GLS estimator is conditionally unbiased given X. 
By theorem 4.3 and the tiniest bit of matrix algebra, 
(12) cov(Bats|X) = (X'G71x) 1. 


There is no ø? in the formula: ø? is built into G. In the case of fixed X, the 
GLS estimator is BLUE by theorem 1. 

In applications, G is usually unknown, and has to be estimated from 
the data. (There are some examples in the next section showing how this is 
done.) Constraints have to be imposed on G. Without constraints, there are 
too many covariances to estimate and not enough data. The estimate G is 
substituted for G in (10), giving the feasible GLS or Aitken estimator Bras: 


(13) rots = (X'G7!X)'X’GT!Y. 
Covariances would be estimated by plugging in G for G in (12): 
(14) cOv(Brais|X) = (X'G7!X)~. 


Sometimes the “plug-in” covariance estimator Cov is a good approximation. 
But sometimes it isn’t—if there are a lot of covariances to estimate and not 
enough data to do it well (chapter 8). Moreover, feasible GLS is usually 
nonlinear. Therefore, BEcLs is usually biased, at least by a little. Remember, 


Brots # BGLs. 
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Exercise set B 


1. If the nx p matrix X has rank p < n and G is n xn positive definite, 
show that G~!/*X has rank p; show also that X/G~!X is px p positive 
definite, hence invertible. Hint: see exercise 3D7. 


2. Let Bots be the OLS estimator in (1), where the design matrix X has 
full rank p < n. Assume (7), i.e., we’re in the GLS model. Show that 
E(Y|X) = XB and cov(Y|X) = G. Verify that E(Bots|X) = ß and 
cov(Bots|X) = (X'X) T! X’GX (X'X) 7}. 

3. Let Bots be the GLS estimator in (1), where the design matrix X has 
full rank p < n. Assume (7). Show in detail that E (Bois |X ) = B and 
cov(Beis|X) = (X'GX)~!. 


5.4 Examples on GLS 


We are in the GLS model (1) with assumption (7) on the errors. The first 
example is right on the boundary between GLS and FGLS. 


Example 1. Suppose I” is a known positive definite n xn matrix and 
G = AI, where à > 0 is an unknown parameter. Because à cancels in 
equations (9)—(10), the GLS estimator is BgLs = (X’~!X)~! X'T!Y. This 
is “weighted” least squares. Because T is fixed, the GLS estimator is linear 
and unbiased given X; the conditional covariance is 4(X'T~!X)~!. More 
directly, we can compute Bots by an OLS regression of I” =1/2Y on PVA, 
after which A can be estimated as the mean square residual; the normalization 
is by n — p. OLS is the special case where 7 = Inxn. 


Example 2. Suppose n is even, K is a positive definite 2x2 matrix, and 


K  Ozx2 =- 02x2 
Ox2 K >- Ox2 
02x2 Oox2 ©- K 


The n x n matrix G has K repeated along the main diagonal. Here, K is 
unknown, to be estimated from the data. Chapter 8 has a case study with this 
sort of matrix. 

Make a first pass at the data, estimating 6 by OLS. This gives ©, with 
residual vector e = Y — XB). Estimate K using mean products of residuals: 


n/2 n/2 n/2 


a A 2 a a 
Ki=-) Bs ais Ky ==) ae Kin =Kn =—) Cy ;_1€0; + 
n 4 2j—1 n4 2j n 4 2j—1“2j 
j=1 j=1 j=1 
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(Division by n — 2 is also fine.) Plug K into the formula for G, and then G 
into (10) to get B () which is a feasible GLS estimator called one-step GLS. 
Now f depends on K. This is feasible GLS, not real GLS. 

The estimation procedure can be repeated iteratively: get residuals off 
BO, use them to re-estimate K, use the new K to get a new G. Now do 
feasible GLS again. Voila: po is the two-step GLS estimator. People 
usually keep going, until the estimator settles down. This sort of procedure 
is called “iteratively reweighted” least squares. 


Caution. Even with real GLS, the usual asymptotics may not apply. 
That is because condition (2) is not a sufficient condition for the 
central limit theorem, and (7) is even weaker. Feasible GLS adds 
another layer of complexity (chapter 8). 


Constraints. In the previous section, we said that to estimate G from 
data, constraints had to be imposed. That is because G has n variances along 
the diagonal, and n(n — 1)/2 covariances off the diagonal—far too many 
parameters to estimate from n data points. What were the constraints in 
example 1? Basically, G had to be a scalar multiple of I”, so there was only 
one parameter in G to worry about—namely, A. Moreover, the estimated 
value for 4 didn’t even come into the formula for B. 

What about example 2? Here, G11, G33,... are all constrained to be 
equal: the common value is called Ky;. Similarly, G22, G44, ... are all 
constrained to be equal: the common value is called K22. Also, G12, G3a,... 
are all constrained to be equal: the common value is called K12. By symmetry, 
G21 = G43 = --- = Ka; = Kj12. The remaining G;; are all constrained to 
be 0. As a result, there are three parameters to estimate: K11, K22, and Kj. 
(Often, there will be many more parameters.) The constraints help explain 
the form of É. For instance, €1, €3,... all have common variance K11. The 
“ideal” estimator for Kı would be the average of ed, e, .... The e’s are 
unobservable, so we use residuals. 

Terminology. Consider the model (1), assuming only that E(e|X) = 
Onx1. Suppose too that the Y; are uncorrelated given X, i.e., cov(€|X) is a 
diagonal matrix. In this setup, homoscedasticity means that var(Y;|X) is the 
same for all i, so that assumption (2) holds—although o? may depend on 
X. Heteroscedasticity means that var(Y;|X) isn’t the same for all i, so that 
assumption (2) fails. Then people fall back on (7) and GLS. 


Exercise set C 


1. Suppose U; are IID fori = 1,...,m with mean « and variance o?. 


Suppose V; are IID fori = 1, ..., with mean a and variance t?. The 
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mean is the same, but variance and sample size are different. Suppose 
the U’s and V’s are independent. How would you estimate if o? 
and t? are known? if o7 and t? are unknown? Hint: get this into 
the GLS framework by defining e; = U; — a for j = 1,...,m, and 
€j = Vi-m —aforj =m+1,...,m+n. 


2. Suppose Y = Xß +e. The design matrix X is nxp with rank p < n, and 
e IL X. The €; are independent with E (e€;) = 0. However, var(€;) = Ac;. 
The c; are known positive constants. 


(a) If à is known and the c; are all equal, show that the GLS estimator 
for 6 is the px 1 vector y that minimizes 


>; Yi — Xiy). 


(b) If à is known, and the c; are not all equal, show that the GLS 
estimator for £ is the p x 1 vector y that minimizes 


X, Œi — Xiy)?/var (¥;|X). 


Hints: In this application, what is the ith row of the matrix equation 
(9)? How is (9) estimated? 

(c) If à is unknown, show that the GLS estimator for 6 is the p x 1 
vector y that minimizes 


>, Yi — Xiv)?/ci- 


3. (Hard.) There are three observations on a variable Y for each individual 
i = 1,2, ..., 800. There is an explanatory variable Z, which is scalar. 
Maria thinks that each subject i has a “fixed effect” a; and there is a 
parameter b common to all 800 subjects. Her model can be stated this 
way: 


Yij = aj + Zijb + €ij fori = 1,2,...,800 and j = 1, 2,3. 


She is willing to assume that the €;; are independent with mean 0. She 
also believes that the €’s are independent of the Z’s and var(e;;) is the 
same for j = 1, 2, 3. But she is afraid that var(e;;) = o? depends on the 
subject i. Can you get this into the GLS framework? What would you 
use for the response vector Y in (1)? The design matrix? (This will get 
ugly.) With her model, what can you say about G in (7)? How would 
you estimate her model? 
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5.5 What happens to GLS if the assumptions break down? 


If E(e|X) Æ Onx1, equation (10) shows the bias in the GLS estimator 
is (X'G~!X)—!X’G~!E(e|X). If E(e|X) = Onx1 but cov(e|X) # G, then 
GLS will be unbiased but (12) breaks down. If G is estimated from data but 
does not satisfy the assumptions behind the estimation procedure, then (13) 
may be a misleading estimator of cov(BrRGLS |X). 


5.6 Normal theory 


In this section and the next, we review the conventional theory of the 
OLS model, which conditions on X—an n x p matrix of full rank p < n— 
and restricts the €; to be independent N (0, 0”). The principal results are the 
t-test and the F-test. As usual, e = Y — XB is the vector of residuals. Fix 
k =1,..., p. Write x for the kth component of the vector $. To test the null 
hypothesis that 6; = 0 against the alternative 6% 4 0, we use the t-statistic: 


(15) t = B,/SE, 


with SE equal to & times the square root of the kkth element of (X’X)~!. We 
reject the null hypothesis when |t| is large, e.g., |t| > 2. For testing at a fixed 
level, the critical value depends (to some extent) on n — p. When n — p is 
large, people refer to the t-test as the “‘z-test:” under the null, ¢ is close to 
N (0, 1). If the terminology is unfamiliar, see the definitions below. 


Definitions. U ~ N (0, 1), for instance, means that the random variable 
U is normally distributed with mean 0 and variance 1. Likewise, U ~ W 
means that U and W have the same distribution. Suppose U;, U2, ... are IID 
N(O, 1). We write xa for a variable distributed as See 1U 2 and say that xa 
has the chi-squared distribution with d degrees of freedom. Furthermore, 


Uefa yo 40? 


has Student’s t-distribution with d degrees of freedom. 


THEOREM 2. With independent N (0, o?) errors, the OLS estimator B 
has a normal distribution with mean # and covariance matrix o? (X'XY L. 
Moreover, e IL ĝ and |jell? ~ o? xi with d =n — p. 


COROLLARY. Under the null hypothesis, ¢ is distributed as U Jy V/d, 
where U ILV, U ~ N(O,1), V~ x2. and d = n — p. In other words, 
if the null hypothesis is right, the t-statistic follows Student’s t-distribution, 
with n — p degrees of freedom. 
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Sketch proof of theorem 2. In the leading special case, X vanishes except 
along the main diagonal of the top p rows, where X;; = 1. For instance, if 
n = 5and p = 2, 


>< 
II 

O O O Oa 

Lane DE ar E a 


0 


The theorem and corollary are pretty obvious in the special case, because 
Y; = 6; + €; fori < p and Y; = €; fori > p. Consequently, B consists of 
the first p elements of Y, cov(B) = O Ipxp = 07(X’X)~!, and e consists 
of p zeros stacked on top of the last n — p elements of Y. 

The general case is beyond our scope, but here is a sketch of the argument. 
The key is finding a p x p upper triangular matrix M such that the columns 
of XM are orthonormal. To construct M, regress column j on the previous 
j — 1 columns (the “Gram-Schmidt process”). The residual vector from this 
regression is the part of column j orthogonal to the previous columns. Since X 
has rank p, column 1 cannot vanish; nor can column j bea linear combination 
of columns 1,..., j — 1. The orthogonal pieces can therefore be normalized 
to have length 1. A bit of matrix algebra shows this set of orthonormal vectors 
can be written as XM, where M;; A 0 for all i and M;; = 0 for alli > j, 
i.e., M is upper triangular. In particular, M is invertible. 

Let S be the special n x p matrix discussed above, with the p x p identity 
matrix in the top p rows and 0’s in the bottom n — p rows. There is ann xn 
orthogonal matrix R with RXM = S. To get R, take the p xn matrix (XM)’, 
whose rows are orthonormal. Add n — p rows to (XM)’, one row ata time, so 
the resulting matrix is orthonormal. In more detail, let Q be the (n — p) xn 
matrix consisting of the added rows, so R is the “partitioned matrix” that 
stacks Q underneath (XMV: 


(XM 1) 
R= ; 
( Q 
The rows of R are orthonormal by construction. So 


Ol(XM)'] = QXM = Omn-p)xp- 
The columns of XM are orthonormal, so (XMV XM = I pxp» Now 


_ (XM) _((XMYXM\_ Lise )e 
Rx = ( Q ) xm ( QXM ) C23 i 


as required. 
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Consider the transformed regression model (RY) = (RXM)y + ô, 
where y = M~'B and 6 = Re. The 4; are IID N (0, o?): see exercise 
3E2. Let y be the OLS estimates from the transformed model, and let 
f = RY — (RXM)y be the residuals. The special case of the theorem 
applies to the transformed model. 

You can check that B = My. So B is multivariate normal, as required 
(section 3.5). The covariance matrix of B can be obtained from theorem 4.3. 
But here is a direct argument: cov(B) = Mcov(~)M’ = o? MM’. We 
claim that MM’ = (X'X)~!. Indeed, RXM = S,so XM = R'S. Then 
M'X'XM = S'RR'S = S'S = Ipxp. Multiply on the left by M’! and on 
the right by MT! to see that X'X = M’-!M~! = (MM’)~!. Invert this 
equation: (X’X)~! = MM’, as required. 

For the residuals, e = R~!f, where f was the residual vector from 
the transformed model. But R~! = R’ is orthogonal, so lel? = fll? ~ 
o? x p cf. exercise 3D3. Independence is the last issue. In our leading 
special case, f Ly. Thus, RT! f I MY, i.e., e I B, completing a sketch 
proof of theorem 2. 


Suppose we drop the normality assumption, requiring only that the €; 
are independent and identically distributed with mean 0 and finite variance 
a. If n is a lot larger than p, and the design matrix is not too weird, then B 
will be close to normal—thanks to the central limit theorem. Furthermore, 
\le||? /(n — p) = o?. The observed significance level—aka P-value—of the 
two-sided t-test will be essentially the area under the normal curve beyond 
+ Be /SE. Without the normality assumption, however, little can be said about 
the asymptotic size of Vn{fllell?/(n —p)|- o?}: this will depend on E(f). 


Statistical significance 


If P < 10%, then Êk is statistically significant at the 10% level, or barely 
significant. If P < 5%, then Êr is statistically significant at the 5% level, or 
statistically significant. If P < 1%, then By is statistically significant at the 
1% level, or highly significant. When n — p is large, the respective cutoffs for 
a two-sided t-test are 1.64, 1.96, and 2.58: see page 309 below. If Bj and Bx 
are both statistically significant, the corresponding explanatory variables are 
said to have independent effects on Y: this has nothing to do with statistical 
independence. 

Statistical significance is little more than technical jargon. Over the 
years, however, the jargon has acquired enormous—and richly undeserved— 
emotional power. For additional discussion, see Freedman-Pisani-Purves 
(2007, chapter 29). 
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Exercise set D 


1. 


We have an OLS model with p = 1, and X is a column of 1’s. Find 
B and ô? in terms of Y and n. If the errors are IID N (0, 07), find the 
distribution of 6 — B, 67, and ./n(B — B)/G. Hint: see exercise 3B 16. 
Lei is a PhD student in sociology. She has a regression equation Y; = 
a +bX;i + Ziy + <i. Here, X; is a scalar, while Z; is a 1 x5 vector 
of control variables, and y is a 5 x 1 vector of parameters. Her theory 
is that b Æ 0. She is willing to assume that the €; are IID N(0, o°), 
independent of X and Z. Fitting the equation to data fori = 1,...,57 
by OLS, she gets b = 3.79 with SE = 1.88. True or false and pn 
(a) For testing the null hypothesis that b = 0, t = 2.02. (Reminder: 
the dotted equals sign means “about equal.”’) 
(b) bis Statistically significant. 
(c) bis highly significant. 
(d) The probability that b Æ 0 is about 95%. 
(e) The probability that b = 0 is about 5%. 
(f) If the model i is right and b = O, there is about a 5% chance of 
getting |b/SE| > 2. 
(g) If the model is right and b = 0, there is about a 95% chance of 
getting |b/SE| < 2. 
(h) Lei can be about 95% confident that b 4 0. 
(i) The test shows the model is right. 
(j) The test assumes the model is right. 


(k) If the model is right, the test gives some evidence that b Æ 0. 


A philosopher of science writes, 


“Suppose we toss a fair coin 10,000 times, the first 5000 tosses 
being done under a red light, and the last 5000 under a green light. 
The color of the light does not affect the coin. However, we would 
expect the statistical null hypothesis—that exactly as many heads 
will be thrown under the red light as the green light—would very 
likely not be true. There will nearly always be random fluctuations 
that make the statistical null hypothesis false.” 


Has the null hypothesis been set up correctly? Explain briefly. 


An archeologist fits a regression model, rejecting the null hypothesis that 
ß2 = 0, with P < 0.005. True or false and explain: 
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(a) b2 must be large. 
(b) Bo must be large. 


5.7 The F-test 


We are in the OLS model. The design matrix X has full rank p < n. 
The €; are independent N (0, o?) with e JL X. We condition on X. Suppose 
po = 1 and po < p. We are going to test the null hypothesis that the last po 
of the f;’s are 0: that is, 8; = Ofori = p — po + 1, ..., p. The alternative 
hypothesis is 6; Æ 0 for at least one i = p — po + 1, ..., p. The usual test 
statistic is called F, in honor of Sir R. A. Fisher. To define F, we need to fit 
the full model (which includes all the columns of X) and a smaller model. 
(A) First, we fit the full model. Let B be the OLS estimate, and e the 
residual vector. 
(B) Next, we fit the smaller model that satisfies the null hypothesis: 
Bi = Oforalli = p—po+l,..., p. Let B™ be the OLS estimate 
for the smaller model. 


In effect, the smaller model just drops the last po columns of X; then B (S) is 
a (p — po) x 1 vector. Or, think of 8“) as p x 1, the last po entries being 0. 
The test statistic is 


(XB? — 1XB 17) /po 


16 F= 
ie llell?/(a — p) 


Example 3. We have a regression model 


Y; = a + bui + cvi + dw; + fzite for i=1,...,72. 


(The coefficients skip from d to f because e is used for the residual vector 
in the big model.) The u, v, w, z are just data, and the design matrix has 
full rank. The e; are IID N(0,o7). There are 72 data points and £ has 5 
components: 


“S20 F8 


So n = 72 and p = 5. We want to test the null hypothesis that d = f = 0. 
So po = 2 and p — po = 3. The null hypothesis leaves the first 3 parameters 
alone but constrains the last 2 to be 0. The small model would just drop w 
and z from the equation, leaving Y; = a+ bu; + cvi + €i fori = 1,..., 72. 
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To say this another way, the design matrix for the big model has 5 
columns. The first column is all 1’s, for the intercept. There are columns for 
u,v, w, Zz. The design matrix for the small model only has 3 columns. The 
first column is all 1’s. Then there are columns for u and v. The small model 
throws away the columns for w and z. That is because the null hypothesis 
says d = f = 0. The null hypothesis does not allow the columns for w and z 
to come into the equation. To compute X B“), use the smaller design matrix; 
or, if you prefer, use the original design matrix and pad out B with two 0’s. 


THEOREM 3. With independent N (0, o?) errors, under the null hypoth- 
esis, 


U/Po 


XB? — [XBW ~U, lel? ~y, F~— > _, 
$ i V/(n — p) 


where U ILV, U ~ o?°xh, and V ~ 07 xj. 


A reminder on the notation: pg is the number of parameters that are 
constrained to 0, while p estimates the other coefficients. The distribution 
of F under the null hypothesis is Fisher’s F -distribution, with po degrees of 
freedom in the numerator and n — p in the denominator. The o cancels out. 
We reject when F is large, e.g., F > 4. For testing at a fixed level, the critical 
value depends on the degrees of freedom in numerator and denominator. See 
page 309 on finding critical values. 

The theorem can be proved like theorem 2; details are beyond our scope. 
Intuitively, if the null hypothesis is right, numerator and denominator are both 
estimating 0”, so F should be around 1. The theorem applies to any po of 
the 6’s; using the last po simplifies the notation. If po and p are fixed while 
n gets large, and the design matrix behaves itself, the normality assumption 
is not too important. If po, p, and n — p are similar in size, normality may 
be an issue. A careful (graduate-level) treatment of the t- and F-tests and 
related theory will be found in Lehmann (1991ab). Also see the comments 
after lab 5 at the back of the book. 


“The” F-test in applied work 


In journal articles, a typical regression equation will have an intercept 
and several explanatory variables. The regression output will usually include 
an F-test, with p — 1 degrees of freedom in the numerator and n — p in 
the denominator. The null hypothesis will not be stated. The missing null 
hypothesis is that all the coefficients vanish, except for the intercept. 

If F is significant, that is often thought to validate the model. Mistake. 
The F-test takes the model as given. Significance only means this: if the 
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model is right and the coefficients are 0, it was very unlikely to get such a 
big F-statistic. Logically, there are three possibilities on the table. (i) An 
unlikely event occurred. (ii) Or the model is right and some of the coefficients 
differ from 0. (iii) Or the model is wrong. So? 


Exercise set E 


1. Suppose U; = æ +ô; fori = 1,...,n. The ô; are independent N (0, o?). 
The parameters œ and ø? are unknown. How would you test the null 
hypothesis that œ = 0 against the alternative that a 4 0? 


2. Suppose U; are independent N («, o?) for i = 1, ..., n. The parameters 
œ and ø? are unknown. How would you test the null hypothesis that 
a = 0 against the alternative that a 4 0? 


3. In exercise 1, what happens if the ô; are IID with mean 0, but are not 
normally distributed? if n is small? large? 


4. InYule’s model (1.1), how would you test the null hypothesis c = d = 0 
against the alternative c #4 0 or d # 0? Be explicit. You can use 
the metropolitan unions, 1871-81, for an example. What assumptions 
would be needed on the errors in the equation? (See lab 6 at the back of 
the book.) 


5. There is another way to define the numerator of the F-statistic. Let e 
be the vector of residuals from the small model. Show that 


IXÊI? — [XB 1? = lle |? — [lell?. 
Hint: what is | XB“ ||? + lle |2? 
6. (Hard.) George uses OLS to fit a regression equation with an intercept, 
and computes R*. Georgia wants to test the null hypothesis that all the 


coefficients are 0, except for the intercept. Can she compute F from RŽ, 
n, and p? If so, what is the formula? If not, why not? 


5.8 Data snooping 


The point of testing is to help distinguish between real effects and chance 
variation. People sometimes jump to the conclusion that a result which is sta- 
tistically significant cannot be explained as chance variation. However, even 
if the null hypothesis is right, there is a 5% chance of getting a “statistically 
significant” result, and there is 1% chance to get a “highly significant” result. 
An investigator who makes 100 tests can expect to get five results that are 
“statistically significant” and one that is “highly significant,” even if the null 
hypothesis is right in every case. 
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Investigators often decide which hypotheses to test only after they’ve 
examined the data. Statisticians call this data snooping. To avoid being 
fooled by statistical artifacts, it would help to know how many tests were run 
before “statistically significant” differences turned up. Such information is 
seldom reported. 

Replicating studies would be even more useful, so the statistical analysis 
could be repeated on an independent batch of data. This is commonplace in 
the physical and health sciences, rare in the social sciences. An easier option 
is cross validation: you put half the data in cold storage, and look at it only 
after deciding which models to fit. This isn’t as good as real replication but 
it’s much better than nothing. Cross validation is standard in some fields, not 
in others. 

Investigators often screen out insignificant variables and refit the equa- 
tions before publishing their models. What does this data snooping do to 
P-values? 


Example 4. Suppose Y consists of 100 independent random variables, 
each being N(0, 1). This is pure noise. The design matrix X is 100 x 50. 
All the variables are independent N (0,1). More noise. We regress Y on 
X. There won’t be much to report, although we can expect an R°? of around 
50/100 = 0.5. (This follows from theorem 3, with n = 100 and po = p = 
50, so B® = 050 x1.) 

Now suppose we test each of the 50 coefficients at the 10% level, and 
keep only the “significant” variables. There will be about 50 x 0.1 = 5 keep- 
ers. If we just run the regression on the keepers, quietly discarding the other 
variables, we are likely to get a decent R?—by social-science standards—and 
dazzling t-statistics. One simulation, for example, gave 5 keeper columns out 
of 50 starters in X. In the regression of Y on the keepers, the R? was 0.2, and 
the t-statistics were — 1.037, 3.637, 3.668, —3.383, —2.536. 


This is just one simulation. Maybe the data set was exceptional? Try 
it yourself. There is one gotcha. The expected number of keepers is 5, but 
the SD is over 3, so there is a lot of variability. With more keepers, the R? is 
likely to be better; with fewer keepers, R? is worse. There is a small chance 
of having no keepers at all—in which case, try again.... 


R? without an intercept. If there is no intercept in a regression equation, 
R? is defined as 


(17) WIP ZIYI. 
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Exercise set F 
1. The number of keeper columns isn’t binomial. Why not? 


2. In a regression equation without an intercept, show that 1 — R? = 
llel|?/||Y ||’, where e = Y — Y is the vector of residuals. 


5.9 Discussion questions 


Some of these questions cover material from previous chapters. 


1. Suppose X; are independent normal random variables with variance 1, 
fori = 1, 2,3. The means are a+ 6, œ +2ß , and 2a + $, respectively. 
How would you estimate the parameters œ and 8? 


2. The F-test, like the t-test, assumes something in order to demonstrate 
something. What needs to be assumed, and what can be demonstrated? 
To what extent can the model itself be tested using F? Discuss briefly. 


3. Suppose Y = XB + € where 
(i) X isnx p of rank p, and 
(ii) E(e|X) = y, a non-random n x | vector, and 
(iii) cov(é|X) = G, a non-random positive definite n xn matrix. 
Let Ê = (X'X)~!X’Y. True or false and explain: 
(a) E(B|X) = B. 
(b) cov(B|X) = 02(X’X)7!. 
In (a), the exceptional case y -L X should be discussed separately. 


4. (This continues question 3.) Suppose p > 1, the first column of X is all 
l’s, and yy =--- = Yn. 


(a) Is ĝi biased or unbiased given X? 
(b) What about Bo? 
5. Suppose Y = XB + € where 
(i) X is fixed not random, n x p of rank p, and 
(ii) the <; are IID with mean 0 and variance o7, but 
(iii) the e; need not be normal. 
Let Ê = (X’X)~!X’Y. True or false and explain: 
(a) E(B) = B. 
(b) cov(B) = o? (X'X)!. 
(c) If n = 100 and p = 6, it is probably OK to use the t-test. 
(d) If n = 100 and p = 96, it is probably OK to use the t-test. 
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6. 


10. 


Suppose that X1, X2,..., Xn, ô1, ô2,..., ôn are independent N (0, 1) 
variables, and Y; = X? — 1 + ô; However, Julia regresses Y; on Xj. 
What will she conclude about the relationship between Y; and X;? 


Suppose U and V1, ..., Vn are IID N (0, 1) variables; u is a real num- 
ber. Let X; = u +U +V. Let X = n! X; and s3? = 
(n — 17! e — XY. 

(a) What is the distribution of X;? 

(b) Do the X; have a common distribution? 

(c) Are the X; independent? 

(d) What is the distribution of X? of s?? 

(e) Is there about a 68% chance that |X — u| < s/./n? 


Suppose X; are N (u, o?) fori = 1,...,n, where n is large. We use X 
to estimate u. True or false and explain: 


(a) If the X; are independent, then X will be around u, being off by 
something like s/,/n; the chance that |X — u| < s/,/n is about 
68%. 


(b) Even if the X; are dependent, X will be around u, being off by 
something like s/,/n; the chance that |X — u| < s/,/n is about 
68%. 


What are the implications for applied work? For instance, how would 
dependence affect your ability to make statistical inferences about u? 
(Notation: X and s? were defined in question 7.) 


Suppose X; has mean u and variance o? fori = 1,...,n, where n is 
large. These random variables have a common distribution, which is not 
normal. We use X to estimate u. True or false and explain: 


(a) If the X; are IID, then X will be around u, being off by something 
like s/,/n; the chance that |X — u| < s/./n is about 68%. 


(b) Even if the X; are dependent, X will be around u, being off by 
something like s/,/n; the chance that |X — u| < s/,/n is about 
68%. 


What are the implications for applied work? (Notation: X and s* were 
defined in question 7.) 


Discussing an application like example 2 in section 4, a social scientist 
says “one-step GLS is very problematic because it simply downweights 
observations that do not fit the OLS model.” 


(a) Does one-step GLS downweight observations that do not fit the 
OLS model? 
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(b) Would this be a bug or a feature? 
Hint: look at exercises C1-2. 


11. You are thinking about a regression model Y = XB + €, with the usual 
assumptions. A friend suggests adding a column Z to the design matrix. 
If you do it, the bigger design matrix still has full rank. What are the 
arguments for putting Z into the equation? Against putting it in? 


12. A random sample of size 25 is taken from a population with mean u. 
The sample mean is 105.8 and the sample variance is 110. The computer 
makes a t-test of the null hypothesis that u = 100. It doesn’t reject the 
null. Comment briefly. 


5.10 End notes for chapter 5 


BLUEness. If X is random, the OLS estimator is linear in Y but not 
X. Furthermore, the set of unbiased estimators is much larger than the set of 
conditionally unbiased estimators. Restricting to fixed X makes life easier. 
For discussion, see Shaffer (1991). There is a more elegant (although perhaps 
more opaque) matrix form of the theorem; see, e.g., 


http://www.stat.berkeley.edu/users/census/GaussMar.pdf 


Example 1. This is the textbook case of GLS, with à playing the role of 
a? in OLS. What justifies our estimator for 4? The answer is that theorem 4.4 
continues to hold under condition (5.2); the proof is essentially the same. On 
the other hand, without further assumptions, the normal approximation is 
unlikely to hold for Bats: see, e.g., 


http://www.stat.berkeley.edu/users/census/cltortho.pdf 


White’s correction for heteroscedasticity. Also called the “Huber-White 
correction.” It may seem natural to estimate the covariance of Bots given X 
as (X'X)~!X’GX(X'X)~!, where e = Y — X Bors is the vector of residuals 
and Gij = ejej: see (8). However, e L X. So X'e = e'X = 0 and the 
proposed matrix is identically 0. On the other hand, if the €e; are assumed 
independent, the off-diagonal elements of G would be set to 0. This often 
works, although Ĝ;; can be so variable that t-statistics are surprisingly non- 
t-like (see notes to chapter 8). With dependence, smoothing can be tried. A 
key reference is White (1980). 


Fixed-effects models. These are now widely used, as are “random-effects 
models” (where subjects are viewed as a random sample from some super- 
population). One example of a fixed-effects model, which illustrates the 
strengths and weaknesses of the technique, is Grogger (1995). 
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Asymptotics for t and F. See end notes for chapter 4, and 
http://www.stat.berkeley.edu/users/census/Ftest.pdf 


Data snooping. The simulation discussed in section 8 was run another 
1000 times. There were 19 runs with no keepers. Otherwise, the simulations 
gave a total of 5213 t-statistics whose distribution is shown in the histogram 
below. A little bit of data-snooping goes a long way: t-statistics with |t| > 2 
are the rule not the exception—in regressions on the keeper columns. If we 
add an intercept to the model, “the” F-test will give off-scale P-values. 


Ty 


J 
-8 -6 -4 -2 0 2 4 6 8 


Replication is the best antidote (Ehrenberg and Bound 1993), but repli- 
cation is unusual (Dewald et al 1986, Hubbard et al 1998). Many texts 
actually recommend data snooping. See, e.g., Hosmer and Lemeshow (2000, 
pp. 95ff): they suggest a preliminary screen at the 25% level, which will 
inflate R* and F even beyond our example. For an empirical demonstration 
of the pitfalls, see Austin et al (2006). 


An informal argument to show that R? = 0.5 in example 4. If Y is 
an n vector of independent N (0, 1) variables, and we project it onto two 
orthogonal linear spaces of dimensions p and q, the squared lengths of the 
projections are independent x? variables, with p and q degrees of freedom, 
respectively. Geometrically, this can be seen as follows. Choose a basis for 
first space and one for the second space. Rotate R” so the basis vectors for 
the two linear spaces become unit vectors, 


Ul,...,Up 


and 
Up+1; oh Up+q> 


where 
uy = (1,0,0,0,...), u2 = (0, 1,0,0,...), u3 = (0,0,1,0,...), .... 


The distribution of Y is unchanged by rotation. The squared lengths of the 
two projections are Ye a Ee Ý; and Y? E a Ying: 
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For the application in example 4, put n = 100 and p = q = 50. Con- 
dition on the random design matrix X. The first linear space is cols X. 
The second linear space consists of all vectors in R!° that are orthogonal 
to cols X. The same idea lurks behind the proof of theorem 2, where p = 1, 
q =n — 1, and the first linear space is spanned by a column of 1’s. A similar 
argument proves theorem 3. Unfortunately, details get tedious when written 
out. 


Corrections for multiple testing. In some situations, there are proce- 
dures for controlling the “false discovery rate” due to multiple testing: see, 
e.g., Benjamini and Hochberg (1995). Other authors recommend against any 
adjustments for multiple testing, on the theory that adjustment would reduce 
power. Such authors never quite explain what the unadjusted P-value means. 
See, e.g., Rothman (1990) or Perneger (1998). 


The discussion questions. Questions 3—6 look at assumptions in the 
regression model. Questions 7—9 reinforce the point that independence is the 
key to estimating precision of estimates from internal evidence. Question 10 
is based on Beck (2001, pp. 276-77). In question 6, the true regression is 
nonlinear: E(Y;|X;) = X;? — 1. Linear approximation is awful. On the 
other hand, if Y; = X;°, linear approximation is pretty good, on average. (If 
you want local behavior, say at 0, linear approximation is a bad idea; it is 
also bad for large x; nor should you trust the usual formulas for the SE.) 

We need the moments of X; to make these ideas more precise (see below). 
The regression of X: on X; equals 3X;. The correlation between X > and 
X; is 3/15 = 0.77. Although the cubic is strongly nonlinear, it is well 
correlated with a linear function. The moments can be used to get explicit 
formulas for asymptotic bias and variance, although this takes more work. 
The asymptotic variance differs from the “nominal” variance—what you get 
from X'X. For additional detail, see 


www.stat.berkeley.edu/users/census/badols.pdf 


Normal moments. Let Z be N(O, 1). The odd moments of Z vanish, by 
symmetry. The even moments can be computed recursively. Integration by 
parts shows that E(Z?"+?) = (2n + 1)E(Z*"). So 


E(Z’) = 1, E(Z*) = 3, E(ZÍ) = 5 x 3 = 15, E(Z®) =7 x 15=105.... 


6 


Path Models 


6.1 Stratification 


A path model is a graphical way to represent a regression equation or 
several linked regression equations. These models, developed by the geneti- 
cist Sewell Wright, are often used to make causal inferences. We will look 
at a couple of examples and then explain the logic, which involves response 
schedules and the idea of stability under interventions. 

Blau and Duncan (1967) are thinking about the stratification process 
in the United States. According to Marxist scholars of the time, the US is 
a highly stratified society. Status is determined by family background and 
transmitted through the school system. Blau and Duncan have data in their 
chapter 2, showing that family background variables do influence status— 
but the system is far from deterministic. The US has a permeable social 
structure, with many opportunities to succeed or fail. Blau and Duncan go 
on to develop the path model shown in figure 1 on the next page, in order to 
answer questions like these: 


“how and to what degree do the circumstances of birth condition sub- 
sequent status? and, how does status attained (whether by ascription 
or achievement) at one stage of the life cycle affect the prospects for a 
subsequent stage?” [p. 164] 
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Figure 1. Path model. Stratification, US, 1962. 
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X .224 WwW U SON’S ED 
X DAD’S OCC 
V DAD’S ED 


The five variables in the diagram are son’s occupation, son’s first job, 
son’s education, father’s occupation, and father’s education. Data come from 
a special supplement to the March 1962 Current Population Survey. The 
respondents are the sons (age 20-64), who answer questions about current job, 
first job, and parents. There are 20,000 respondents. Education is measured 
on a scale from 0 to 8, where 0 means no schooling, 1 means 1—4 years 
of schooling, ..., 8 means some post-graduate education. Occupation is 
measured on Duncan’s prestige scale from 0 to 96. The scale takes into 
account income, education, and raters’ opinions of job prestige. Hucksters 
and peddlers are near the bottom of the pyramid, with clergy in the middle 
and judges at the top. 

The path diagram uses standardized variables. Before running regres- 
sions, you subtract the mean from each data variable, and divide by the stan- 
dard deviation. After standardization, means are 0 and variances are 1; fur- 
thermore, variables pretty much fall in the range from —3 to 3. Table 1 shows 
the correlation matrix for the data. 

How is figure 1 to be read? The diagram unpacks to three regression 
equations: 


(1) U =aV +bX +8, 
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Table 1. Correlation matrix for variables in Blau and Duncan’s path model. 


Y W U X V 
Son’s occ Son’s 1% job Sonsed Dadď’socce Dad’sed 

Y — Son’s occ 1.000 .541 .596 .405 322 
W Son’s 1% job 41 1.000 538 AIT 332 
U Son’s ed 596 538 1.000 438 453 
X  Dad’s occ 405 AIT 438 1.000 516 
V Daď’sed .322 .332 .453 .516 1.000 

(2) W=cU+dX +e, 

(3) Y =eU+fX+8W +n. 


Equations are estimated by least squares. No intercepts are needed because 
the variables are standardized. (See exercise C6 for the reasoning on the 
intercepts; statistical assumptions will be discussed in section 5 below.) 

In figure 1, the arrow from V to U indicates a causal link, and V is entered 
by Blau and Duncan on the right hand side of the regression equation (1) that 
explains U. The path coefficient 0.310 next to the arrow is the estimated 
coefficient å of V. The number 0.859 on the “free arrow” (that points into 
U from outside the diagram) is the estimated standard deviation of the error 
term ô in (1). The free arrow itself represents ô. 

The other arrows in figure 1 are interpreted in a similar way. There 
are three equations because three variables in the diagram (U, W, Y) have 
arrows pointing into them. The curved line joining V and X is meant to 
indicate association rather than causation: V and X influence each other, or 
are influenced by some common causes not represented in the diagram. The 
number on the curved line is just the correlation between V and X (table 1). 

The Census Bureau (which conducts the Current Population Survey used 
by Blau and Duncan) would not release raw data, due to confidentiality con- 
cerns. The Bureau did provide the correlation matrix in table 1. As it turns 
out, the correlations are all that is needed to fit the standardized equations. 
We illustrate the process on equation (1), which can be rewritten in matrix 
form as 


4) U=M()+8, 
where U and 6 are n x 1 vectors, while M is the n x 2 “partitioned matrix” 


M= (V X). 
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In other words, the design matrix has one row for each subject, one column 
for the variable V, and a second column for X. Initially, father’s education is 
in the range from 0 to 8. After it is standardized to have mean 0 and variance 1 
across respondents, V winds up (with rare exceptions) in the range from —3 
to 3. Similarly, father’s occupation starts in the range from 0 to 96, but X 
winds up between —3 and 3. Algebraically, the standardization implies 


| eee ee 
5 -= v =0, -y Ve ek, 
(5) 72 nM 


Similarly for X and U. In particular, 
1 n 
(6) DRS 2 Vi Xi 


is the data-level correlation between V and X, computed across respondents 
i=1,...,n. See equation (2.4). 

To summarize the notation, the sample size n is about 20,000. Next, 
V; is the education of the ith respondent’s father, standardized. And X; is 
the father’s occupation, scored on Duncan’s prestige scale from 0 to 96, then 
standardized. So, 


mua ( Viste Die WX 1 oryx 1.000 0.516 
= =n =n $ 
Ya VX: Ee ryx 1 0.516 1.000 
(You can find the 0.516 in table 1.) Similarly, 
MU = X= ViUi Ass Ce) =% 0.453 
iat XiUi rxu 0.438 
Now we can use equation (4.6) to get the OLS estimates: 


a EE sine 0.309 
0.278 } 


> 


These differ in the 3rd decimal place from path coefficients in figure 1, prob- 
ably due to rounding. 

What about the numbers on the free arrows? The residual variance in a 
regression equation—the mean square of the residuals—is used to estimate 
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the variance of the disturbance term. Let ô? be the residual variance in (1). 
We're going to derive an equation that can be solved for G*. As a first step, 
let ô be the residuals after fitting (1) by OLS. Then 


1 n 
7) l=- Ee U? because U is standardized 
am 


LENA Z 7 
> 5 (av; + bX; +Â) 
i=1 


II 


aala malana i a ee low 
ae a Pi Dae ae T 
I= I= l= t= 


Two cross-product terms were dropped in (7). This is legitimate because the 
residuals are orthogonal to the design matrix, so 


Adee hea ne 
re A ere otal 
I= i= 


Because V and X were standardized, 
Sv- IY= Lywx 
= =l, — ~“=1, - ;Xj = ryx. 
rE l Mrzi l m a m 


Substitute back into (7). Since ô? is the mean square of the residuals ô, 
(8) 1=8 +6? +24bryx +6. 


Equation (8) can be solved for ô&?. Take the square root to get the SD. 
The SDs are shown on the free arrows in figure 1. With a small sample, 
this isn’t such a good way to estimate o”, because it doesn’t take degrees of 
freedom into account. The fix would be to multiply 7 by n/(n — p). When 
n = 20,000 and p = 3 or 4, this is not an issue. If n were a lot smaller, in 
standardized equations like (1) and (2) with two variables, the best choice for 
p is 3. Behind the scenes, there is an intercept being estimated. That is the 
third parameter. In an equation like (3), with three variables, take p = 4. The 
sample size n cancels when computing the path coefficients, but is needed 
for standard errors. 

The large SDs in figure 1 show the permeability of the social struc- 
ture. (Since variables are standardized, the SDs cannot exceed 1—exercise 4 
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below—so 0.753 is a big number.) Even if we know your family background 
and your education and your first job, the variation in the social status of your 
current job is 75% of the variation in the full sample. Variation is measured 
by SD not variance: variance is on the wrong scale. 

The big SDs are a good answer to the Marxist argument, and so is 
the data analysis in Blau and Duncan (1967, chapter 2). As social physics, 
however, figure 1 leaves something to be desired. Why linearity? Why the 
same coefficients for everybody? What about variables like intelligence or 
motivation? And where are the mothers?? 

Now let’s return to standardization. Standardizing might be sensible if 
(i) units are meaningful only in comparative terms (e.g., prestige points), or 
(ii) the meaning of units changes over time (e.g., years of education) while 
correlations are stable. 

If the object is to find laws of nature that are stable under intervention, 
standardizing may be a bad idea, because estimated parameters would depend 
on irrelevant details of the study design (section 2 below). Generally, the 
intervention idea gets muddier with standardization. It will be difficult to hold 
the standard deviations constant when individual values are manipulated. If 
the SDs change too, what is supposed to be invariant and why? (Manipulation 
means an intervention, as in an experiment, to set a variable at the value chosen 
by the investigator: there is no connotation of unfairness.) 

For descriptive statistics, with only one data set at issue, standardizing is 
really a matter of taste: do you like pounds, kilograms, or standard units? All 
variables are similar in scale after standardization, which may make it easier 
to compare regression coefficients. That could be why social scientists like 
to standardize. 

The terminology is peculiar. “Standardized regression coefficients” are 
just coefficients that come from fitting the equation to standardized vari- 
ables. Similarly, “unstandardized regression coefficients” come from fitting 
the equation to the “raw”—unstandardized—variables. It is not coefficients 
that get standardized, but variables. 


Exercise setA 


1. Fit the equations in figure 1; find the SDs. (Cf. lab 8, back of book.) 


2. Is a in equation (1) a parameter or an estimate? 0.322 in table 1? 0.310 
in figure 1? How is 0.753 in figure 1 related to equation (3)? 


3. True or false, and explain: after fitting equation (1), the mean square of 
the residuals equals their variance. 
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4. Prove that the SDs in a path diagram cannot exceed 1, if variables are 
standardized. 


5. When considering what figure 1 says about permeability of the social 
system, should we measure variation in status by the SD, or variance? 


6. In figure 1, why is there no arrow from V to W or V to Y? In principle, 
could there be an arrow from Y to U? 


7. What are some important variables omitted from equation (3)? 


8. The education variable in figure 1 takes values 0, 1, ..., 8. Does that 
have any implications for linearity in (1)? What if the education vari- 
able only took values 0, 1, 2, 3, 4? If the education variable only took 
values 0 and 1? 


6.2 Hooke’s law revisited 


According to Hooke’s law (section 2.3), if weight x is hung on a spring, 
and x is not too large, the length of the spring is a + bx + e. (Near the elastic 
limit of the spring, the physics will be more complicated.) In this equation, a 
and b are physical constants that depend on the spring, not the weights. The 
parameter a is the length of the spring with no load. The parameter b is the 
length added to the spring by each additional unit of weight. The € is random 
measurement error, with the usual assumptions. 

If we were to standardize, the crucial slope parameter would depend on 
the weights and on the accuracy of the device used to measure the length 
of the spring. To see this, let v > 0 be the variance of the weights used 
in the experiment. Let o? be the variance of e. Let s? be the mean square 
of the residuals (normalized by n, not n — p). The standardized regression 
coefficient is 


A Vv v 
9 b |< =b |——, 
4 by +s? bv + 0? 


by exercise 2 below. The dotted equals sign means “approximately equal.” 

The standardized regression coefficient tells us about a parameter—the 
right hand side of (9)—that depends on v and o°. But v and o°? are features 
of the measurement procedure, not the spring. The parameter we want to 
estimate is b, which tells us how the spring responds when the load is manip- 
ulated. The unstandardized b works like a charm; the standardized b could 
be misleading. More generally, if a regression coefficient is stable under 
interventions, standardizing is not a good idea—stability will get lost in the 
shuffle. That is what (9) shows. 


Standardize coefficients only if there is a good reason to do so. 
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Exercise set B 


1. Is v in equation (9) the variance of a data variable, or a random variable? 
What about o7? 


2. Check that the left hand side of (9) is the standardized slope. Hint: work 
out the correlation coefficient between the weights and the lengths. 


3. What happens to (9) if o? = 0? What would that tell us about springs 
and weights? 


6.3 Political repression during the McCarthy era 


Gibson (1988), reprinted at the back of the book, is about the causes of 
McCarthyism in the United States—the great witch-hunt for Reds in public 
life, particularly in Hollywood and the State Department. With the opening 
of Soviet archives, it became pretty clear there had been many agents of 
influence in the US, but McCarthy probably did more harm than all of them 
put together. 

Was repression due to the masses or the elites? Gibson argues that elite 
intolerance is the root cause. His chief piece of empirical evidence is the 
path diagram in figure 2, redrawn from the paper. The unit of analysis is the 
state. The dependent variable is a measure of repressive legislation in each 
state (table 1 in the paper, and note 4). The independent variables are mean 
tolerance scores for each state, derived from the “Stouffer survey of masses 
and elites” (table A1 in the paper, and note 8). The “masses” are just ordinary 
people who turn up in a probability sample of the population. “Elites” include 


Figure 2. Path model. The causes of McCarthyism. The free arrow 
pointing into Repression is not shown. 
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school board presidents, commanders of the American Legion, bar association 
presidents, and trade union officials, drawn from lists of community leaders 
in medium-size cities (Stouffer 1955, pp. 17-19). 

Data on masses were available for 36 states; on elites, for 26 states. 
Gibson computes correlations from the available data, then estimates a stan- 
dardized regression equation. He says, 


“Generally, it seems that elites, not masses, were responsible for the 
repression of the era.... The beta for mass opinion is —.06; for elite 
opinion, it is —.35 (significant beyond .01).” 


His equation for legislative scores is 
(10) Repression = £; Mass tolerance + $2 Elite tolerance + ô. 


Variables are standardized. The two straight arrows in figure 2 represent 
causal links: mass and elite tolerance affect repression. The estimated coeffi- 
cients are B 1 = —0.06 and po = —0.35. The curved line in figure 2 represents 
an association between mass and elite tolerance scores. Each one can influ- 
ence the other, or both can have some common cause. The association is not 
analyzed in the diagram. 

Gibson is looking at an interesting qualitative question: was it the masses 
or the elites who were responsible for McCarthyism? To address this issue by 
regression, he has to quantify everything—tolerance, repression, the causal 
effects, and statistical significance. The quantification is problematic. More- 
over, as social physics, the path model is weak. Too many crucial issues are 
left dangling. What intervention is contemplated? Are there other variables 
in the system? Why are relationships linear? Signs apart, for example, why 
does a unit increase in tolerance have the same effect on repression as a unit 
decrease? Why are coefficients the same for all states? Why are states sta- 
tistically independent? Such questions are not addressed in the paper. (The 
paper is not unique in this respect.) 

McCarthy became a force in national politics with a speech attacking 
the State Department in 1950. The turning point came in 1954, with public 
humiliation in the Army-McCarthy hearings. Censure by the Senate followed 
in 1957. Gibson scores repressive legislation over the period 1945-65, long 
before McCarthy mattered, and long after (note 4 in the paper). The Stouffer 
survey was done in 1954, when the McCarthy era was ending. The timetable 
does not hang together. 

Even if all such issues are set aside, and we allow Gibson the statistical 
assumptions, there is a big problem. Gibson finds that po is significant and 
Bi is insignificant. But this does not impose much of a constraint on the 
difference Bo — Bi. The standard error for the difference can be computed 
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from data in the paper (exercise 4 below). The difference is not significant. 
Since 2 = £; is a viable null hypothesis, the data are not strong enough to 
distinguish elites from masses. 

The fitting procedure is also worth some attention. Gibson used GLS 
rather than OLS because he “could not assume that the variances of the ob- 
servations were equal”; instead, he “weighted the observations by the square 
root of the numbers of respondents within the state” (note 9 in the paper). 
This confuses the variance of Y; with the variance of X;. When observations 
are independent, but var(Y;|X) differs from one i to another, B should be 
chosen (exercise 5C2) to minimize 


£, Œ; — XB)? /var (¥;|X). 


Gibson’s Y; is the repression score. The variance of Y; has nothing to do 
with the Stouffer survey. Therefore, weighting the regression by the number 
of respondents in the Stouffer survey makes little sense. The number of 
respondents affects the variance of X;, not the variance of Yj. 


Exercise set C 


1. Is the —0.35 in figure 2 a parameter or an estimate? How is it related to 
equation (10)? 

2. The correlation between mass and elite tolerance scores is 0.52; between 
mass tolerance scores and repression scores, —0.26; between elite toler- 
ance scores and repression scores, —0.42. Compute the path coefficients 
in figure 2. 

Note. Exercises 2—4 can be done on a pocket calculator, but it’s easier with a 
computer: see lab 9 at the back of the book, and exercise 4B14. Apparently, 
Gibson used weighted regression; exercises 2—4 do not involve weights. But 
see http://www.stat.berkeley.edu/users/census/repgibson.pdf. 

3. Estimate the SD of ô in equation (10). You may assume the correlations 
are based on 36 states but you need to decide if p is 2 or 3. (See text for 
Gibson’s sample sizes.) 

Find the SEs for the path coefficients and their difference. 

5. The repression scale is lumpy: scores go from 0 to 3.5 in steps of 0.5 
(table 1 in the paper). Does this make the linearity assumption more 
plausible, or less plausible? 


6. Suppose we run a regression of Y on U and V, getting 


Y =â+bU+ĉV +e, 
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where e is the vector of residuals. Express the standardized coefficients 
in terms of the unstandardized coefficients and the sample variances of 
U,V,Y. 


6.4 Inferring causation by regression 


The key to making causal inferences by regression is a response schedule. 
This is a new idea, and a complicated one. We’ll start with a mathematical 
example to illustrate the idea of a “place holder.” Logarithms can be defined 
by the equation 


x 
1 
(11) logx = f -dz for 0 <x < œ. 
1 zZ 


The symbol oo stands for “infinity.” But what does the x stand for? Not 
much. It’s a place holder. You could change both x’s in (11) to u’s with- 
out changing the content, namely, the equality between the two sides of the 
equation. Similarly, z is a place holder inside the integral. You could change 
both z’s to v’s without changing the value of the integral. (Mathematicians 
refer to place holders as “dummy variables,” but statisticians use the language 
differently: section 6 below.) 

Now let’s take an example that’s closer to regression—Hooke’s law 
(section 2). Suppose we’re going to hang some weights on a spring. We do 
this on n occasions, indexed by i = 1, ...,. Fix ani. If we put weight x on 
the spring on occasion i, our physicist assures us that the length of the spring 
will be 


(12) Yi x = 439 +0.05x + €i. 


If we put a 5-unit weight on the spring, the length will be 439+0.05x5+e€; = 
439.25 + €i. If instead we put a 6-unit weight on the spring, the length will 
be 439.30 + €;. A l-unit increase in x makes the spring longer, by 0.05 
units—causation has come into the picture. The random disturbance term €; 
represents measurement error. These random errors are IID fori = 1,...,7, 
with mean 0 and known variance o°. The units for x are kilograms; the units 
for length are centimeters, so €; and o must be in centimeters too. (Reminder: 
IID is shorthand for independent and identically distributed.) 

Equation (12) looks like a regression equation, but it isn’t. Itis aresponse 
schedule that describes a theoretical relationship between weight and length. 
Conceptually, x is a weight that you could hang on the spring. If you did, 
equation (12) tells you what the spring would do. This is all in the subjunctive. 
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Formally, x is a place holder. The equation gives length Y; x as a function 
of weight x, with a bit of random error. For any particular i, we can choose 
one x, electing to observe Y;, for that x and that x only. The rest of the 
response schedule—the Y; x for the other x’s—would be lost to history. 

Let’s make the example a notch closer to social science. We might not 
know (12), but only 


(13) Yix =a +bx+éi, 


where the €; are IID with mean 0 and variance o?. This time, a, b, and o? 
are unknown. These parameters have to be estimated. More troublesome: 
we can’t do an experiment. However, observational data are available. On 
occasion i, weight X; is found on the spring; we just don’t quite know how it 
got there. The length of the spring is measured as Y;. We’re still in business, 
if 

(i) Y; was determined from the response schedule (13), so Y; = Y; x; = 

a+bxX; + €i, and 

(ii) the X;’s were chosen at random by Nature, independent of the €;’s. 

Condition (i) ties the observational data to the response schedule (13), 
and gives us most of the statistical conditions we need on the random errors: 
these errors are IID with mean 0 and variance 0”. Condition (ii) is exogeneity. 
Exogeneity—X || e—is the rest of what we need. With these assumptions, 
OLS gives unbiased estimates for a and b. Example 4.1 explains how to set 
up the design matrix. Conditions (4.1-5) are all satisfied. 

The response schedule tells us that the parameter b we’re estimating has 
a causal interpretation: if we intervene and change x to x’, then y is expected 
to change by b(x’ — x). The response schedule tells us that the relation is 
linear rather than quadratic or cubic or ... . It tells us that interventions won’t 
affect a or b. It tells us the errors are IID. It tells us there is no confounding: X 
causes Y without any help from any other variable. The exogeneity condition 
says that Nature ran the observational study just the way we would run an 
experiment. We don’t have to randomize. Nature did it for us. Nice. 

What would happen without exogeneity? Suppose Nature puts a big 
weight X; on the spring whenever e; is large and positive. Nasty. Now OLS 
over-estimates b. In this hypothetical, the spring doesn’t stretch as much as 
you might think. Measurement error gets mixed up with stretch. (This is 
“selection bias” or “endogeneity bias,” to be discussed in chapters 7 and 9.) 
The response schedule is a powerful assumption, and so is exogeneity. For 
Hooke’s law, the response schedule and exogeneity are reasonably convinc- 
ing. With typical social science applications, there might be some harder 
questions to answer. 
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The discussion so far is about a one-dimensional x, but the generalization 
to higher dimensions is easy. The response schedule would be 


(14) Y; x = xp + €i, 


where x is 1x p vector of treatments and £ is a px1 parameter vector. Again, 
the errors €; are IID with mean 0 and variance o°. In the next section, we’ ll 
see that path models put together several response schedules like (14). 


A response schedule says how one variable would respond, if you 
intervened and manipulated other variables. Together with the ex- 
ogeneity assumption, the response schedule is a theory of how the 
data were generated. If the theory is right, causal effects can be 
estimated from observational data by regression. If the theory is 
wrong, regression coefficients measure association not causation, 
and causal inferences can be quite misleading. 


Exercise set D 


1. (This is a hypothetical; SAT stands for Scholastic Achievement Test, 
widely used for college admissions in the US.) Dr. Sally Smith is doing a 
study on coaching for the Math SAT. She assumes the response schedule 
Yi x = 450 + 3x + ôi. In this equation, Y; x is the score that subject i 
would get on the Math SAT with x hours of coaching. The error term ô; 
is normal, with mean 0 and standard deviation 100. 


(a) Ifsubject #77 gets 10 hours of coaching, what does Dr. Smith expect 
for this subject’s Math SAT score? 

(b) Ifsubject #77 gets 20 hours of coaching, what does Dr. Smith expect 
for this subject’s Math SAT score? 

(c) Ifsubject #99 gets 10 hours of coaching, what does Dr. Smith expect 
for this subject’s Math SAT score? 

(d) Ifsubject #99 gets 20 hours of coaching, what does Dr. Smith expect 
for this subject’s Math SAT score? 


2. (This continues exercise 1; it is still a hypothetical.) After thinking 
things over, Dr. Smith still believes that the response schedule is linear: 
Yi x = a + bx + ôi, the ô; being ID N (0, o*). But she decides that her 
values for a, b, and o? are unrealistic. (They probably are.) She wants 
to estimate these parameters from data. 


(a) Does she need to do an experiment, or can she get by with an 
observational study? (The latter would be much easier to do.) 
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(b) If she can use observational data, what else would she have to 
assume, beyond the response schedule? 


(c) And, how would she estimate the parameters from the observational 
data? 


6.5 Response schedules for path diagrams 


Path models are often held out as rigorous statistical engines for inferring 
causation from association. Statistical techniques can indeed be rigorous— 
given their assumptions. But the assumptions are usually imposed on the data 
by the analyst: this is not a rigorous process. The assumptions behind the 
models are of two kinds: (i) causal and (ii) statistical. This section will lay 
out the assumptions in more detail. A relatively simple path model is shown 
in figure 3, where a hypothesized causal relationship between Y and Z is 
confounded by X. 


Figure 3. Path model. The relationship between Y and Z is con- 
founded by X. Free arrows leading into Y and Z are not shown. 


X 


This sort of diagram is used to draw causal conclusions from observa- 
tional data. The diagram is therefore more complicated than it looks: cau- 
sation is a complicated business. Let’s assume that Dr. Alastair Arbuthnot 
has collected data on X, Y, and Z in an observational study. He draws the 
diagram shown in figure 3, and fits the two regression equations suggested 
by the figure: 


Y=4+6X+emor, Z=C+dX +éY + error 


Estimated coefficients are positive and significant. He is now trying to explain 
the findings to his colleague, Dr. Beverly Braithwaite. 
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Dr. A So you see, Dr. Braithwaite, if X goes up by one unit, then Y 
goes up by É units. 

Dr.B Quite. 

Dr. A Furthermore, if X goes up by one unit with Y held fixed, then Z 
goes up by d units. This is the direct effect of X on Z. [“Held 
fixed” means, kept the same; the “indirect effect” is through Y.] 


Dr. B But Dr. Arbuthnot, you just told me that if X goes up by one unit, 
then Y will go up by b units. 

Dr. A Moreover, if Y goes up by one unit with X held fixed, the change 
in Y makes Z go up by é units. The effect of Y on Z is é. 


Dr. B Dr. Arbuthnot, hello, why would Y go up unless X goes up? 
“Effects”? “Makes”? How did you get into causation?? And 
what about my first point?!? 


Dr. Arbuthnot’s explanation is not unusual. But Dr. Braithwaite has 
some good questions. Our objective in this section is to answer her, by 
developing a logically coherent set of assumptions which—if true—would 
justify Dr. Arbuthnot’s data analysis and his interpretations. On the other 
hand, as we will see, Dr. Braithwaite has good reason for her skepticism. 

At the back of his mind, Dr. Arbuthnot has two response schedules de- 
scribing hypothetical experiments. In principle, these two experiments are 
unrelated to one another. But, to model the observational study, the experi- 
ments have to be linked in a special way. We will describe the two experiments 
first, and then explain how they are put together to model Dr. Arbuthnot’s data. 

(i) First hypothetical experiment. Treatment at level x is applied to a 
subject. A response Y is observed, corresponding to the level of treatment. 
There are two parameters, a and b, that describe the response. With no 
treatment (x = 0), the response level for each subject will be a, up to random 
error. All subjects are assumed to have the same value for a. Each additional 
unit of treatment adds b to the response. Again, b is the same for all subjects 
at all levels of x, by assumption. Thus, when treatment is applied at level x, 
the response Y is assumed to be 


(15) a + bx + random error. 


For example, colleges send students with weak backgrounds to summer boot- 
camp with mathematics drill. In an evaluation study of such a program, x 
might be hours spent in math drill, and Y might be test scores. 

Gi) Second hypothetical experiment. In the second experiment, there are 
two treatments and a response variable Z. There are two treatments because 
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there are two arrows leading into Z. The treatments are labeled X and Y in 
figure 3. Both treatments may be applied to a subject. In Experiment #1, Y 
was the response variable. But in Experiment #2, Y is one of the treatment 
variables: the response variable is Z. 

There are three parameters, c, d, and e. With no treatment at all (x = 
y = 0), the response level for each subject will be c, up to random error. 
Each additional unit of treatment X adds d to the response. Likewise, each 
additional unit of treatment Y adds e to the response. (Here, e is a parameter 
not a residual vector.) The constancy of parameters across subjects and levels 
of treatment is an assumption. Thus, when the treatments are applied at levels 
x and y, the response Z is assumed to be 


(16) c + dx + ey + random error. 


Three parameters are needed because it takes three parameters to specify the 
linear relationship (16), an intercept and two slopes. 

Random errors in (15) and (16) are assumed to be independent from 
subject to subject, with a distribution that is constant across subjects: the 
expectation is zero and the variance is finite. The errors in (16) are assumed 
to be independent of the errors in (15). Equations (15) and (16) are response 
schedules: they summarize Dr. Arbuthnot’s ideas about what would happen 
if he could do the experiments. 

Linking the experiments. Dr. Arbuthnot collected the data on X, Y, Z in 
an observational study. He wants to use the observational data to figure out 
what would have happened if he could have intervened and manipulated the 
variables. There is a price to be paid. 

To begin with, he has to assume the response schedules (15) and (16). 
He also has to assume that the X’s are independent of the random errors 
in the two hypothetical experiments—“exogeneity.” Thus, Dr. Arbuthnot is 
pretending that Nature randomized subjects to levels of X. If so, there is no 
need for experimental manipulation on his part, which is convenient. The 
exogeneity of X has a graphical representation: arrows come out of X in 
figure 3, but no arrows lead into X. 

Dr. Arbuthnot also has to assume that Nature generates Y from X as if 
by substituting X into (15). Then Nature generates Z as if by substituting X 
and Y—the very same X that was the input to (15) and the Y that was the 
output from (15)—into (16). Using the output from (15) as an input to (16) 
is what links the two equations together. 

Let’s take another look at this linkage. In principle, the experiments 
described by the two response schedules are separable from one another. 
There is no a priori connection between the value of x in (15) and the value 
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of x in (16). There is no a priori connection between outputs from (15) and 
inputs to (16). However, to model his observational study, Dr. Arbuthnot 
links the equations “recursively.” He assumes that one value of X is chosen 
and used as an input for both equations; that the Y generated from (15) is 
used as an input to (16); and there is no feedback from (16) to (15). 

Given all these assumptions, the parameters a, b can be estimated by 
regression of Y on X. Likewise, c,d, e can be estimated by regression of 
Z on X and Y. Moreover, the regression estimates have legitimate causal 
interpretations. This is because causation is built into the response sched- 
ules (15) and (16). If causation were not assumed, causation would not be 
demonstrated by running the regressions. 

One point of Dr. Arbuthnot’s regressions is to estimate the direct effect 
of X on Z. The direct effect is d in (16). If X is increased by one unit with Y 
held fixed—i.e., kept at its old value—then Z is expected to go up by d units. 
This is shorthand for the mechanism in the second experiment. The response 
schedule (16) says what happens to Z when x and y are manipulated. In 
particular, y can be held at an old value while x is made to increase. 

Dr. Arbuthnot imagines that he can keep the Y generated by Nature, 
while replacing X by X + 1. He just substitutes his values (X + 1 and Y) 
into the response schedule (16), getting 


c+d(X +1) + eY + error = (c + dX + eY + error) +d. 


This is what Z would have been, if X had been increased by 1 unit with Y 
held fixed: Z would have been d units bigger. 

Dr. Arbuthnot also wants to estimate the effect e of Y on Z. If Y is 
increased by one unit with X held fixed, then Z is expected to go up by e 
units. Dr. Arbuthnot thinks he can keep Nature’s value for X, while replacing 
Y by Y + 1. He just substitutes X and Y + 1 into the response schedule (16), 
getting 


c+dX +e(¥ +1) + error = (c + dX + eY + error) + e. 


This is what Z would have been, if Y had been increased by 1 unit with X kept 
unchanged: Z would have been e units bigger. Of course, even Dr. Arbuthnot 
has to replace parameters by estimates. If e = 0—or could be 0 because ê is 
Statistically insignificant—then manipulating Y should not affect Z, and Y 
would not be a cause of Z after all. This is a qualitative inference. Again, the 
inference depends on the response schedule (16). 

In short, Dr. Arbuthnot uses the observational data to estimate parame- 
ters. But when he interprets the results—for instance, when he talks about the 
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“effects” of X and Y on Z—he’s thinking about the hypothetical experiments 
described by the response schedules (15)-(16), not about the observational 
data themselves. His causal interpretations depend on a rather subtle model. 
Among other things, the same response schedules, with the same parameter 
values, must apply (i) to the hypothetical experiments and (ii) to the obser- 
vational data. In shorthand, the values of the parameters are stable under 
interventions. 

To state the model more formally, we would index the subjects by a 
subscript i in the range from 1 to n. In this notation, X; is the value of X for 
subject i. The level of treatment #1 is denoted by x, and Y; x is the response 
for variable Y when treatment at level x is applied to subject i, as in (15). 
Similarly, Z; x,y is the response for variable Z when treatment #1 at level x 
and treatment #2 at level y are applied to subject i, as in (16). The response 
schedules are interpreted causally. 


e Yi xis what Y; would be if X; were set to x by intervention. 


e Zi,x,y is what Z; would be if X; were set to x and Y; were set to y 
by intervention. 


Figure 3 unpacks into two equations, which are more precise versions 
of (15) and (16), with subscripts for the subjects: 


(17) Yix =a +bx + ôi, 
(18) Zi xy =C+dx+eyt+é. 


The parameters a, b, c, d, e and the error terms 4;, €; are not ob- 
served. The parameters are assumed to be the same for all subjects. There 
are assumptions about the error terms—the statistical component of the as- 
sumptions behind the path diagram: 


(i) 6; and €; are independent of each other within each subject i. 
(ii) These error terms are independent across subjects i. 
(iii) The distribution of ô; is constant across subjects 7; so is the distribu- 
tion of €;. (However, ô; and e; need not have the same distribution.) 
(iv) ô; and e; have expectation zero and finite variance. 
(v) The X;’s are independent of the ô;’s and €;’s, where X; is the value 
of X for subject i in the observational study. 


Assumption (v) says that Nature chooses X; for us as if by randomization. 
In other words, the X;’s are “exogenous.” By further assumption, Nature 
determines the response Y; for subject i as if by substituting X; into (17): 


Yi = Yi x, =a+bX; + ôi. 
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The rest of the response schedule— Y; x for x Æ X;—is not observed. Af- 
ter all, even in an experiment, subject i would be assigned to one level of 
treatment. The response at other levels would not be observed. 

Similarly, we observe Z; x,y only for x = X; and y = Y;. The response 
for subject i is determined by Nature, as if by substituting X; and Y; into (18): 


Zi = Zi X,Y; = c+dXi+eYi +6. 


The rest of the response schedule remains unobserved, namely, the responses 
Zi,x,y for all the other possible values of x and y. Economists call the unob- 
served Y; x and Z; x,y potential outcomes. The model specifies unobservable 
response schedules, not just regression equations. 

The model has another feature worth noticing: each subject’s responses 
are determined by the levels of treatment for that subject only. Treatments 
applied to subject j do not affect the responses of subject i. For treating 
infectious diseases, this is not such a good model. (If one subject sneezes, 
another will catch the flu: stop the first sneeze, prevent the second flu.) There 
may be similar problems with social experiments, when subjects interact with 
each other. 


Figure 4. The path diagram as a box model. 


Y=a+bD| |+ 


Z=ct+d| |+eļ| |+ 


The box model in figure 4 illustrates the statistical assumptions. Inde- 
pendent random errors with constant distributions are represented as draws 
made at random with replacement from a box of potential errors (Freedman- 
Pisani-Purves 2007). Since the box remains the same from one draw to 
another, the probability distribution of one draw is the same as the distribu- 
tion of any other. The distribution is constant. Furthermore, the outcome of 
one draw cannot affect the distribution of another. That is independence. 
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Figure 4 also shows how the two hypothetical causal mechanisms— 
response schedules (17) and (18)—are linked together to model the observa- 
tional data. Let’s take this apart and put it back together. We can think about 
each response schedule as a little machine, which accepts inputs and makes 
output. There are two of these machines at work. 

e First causal mechanism. You feed an x—any x that you like—into 


machine #1. The output from the machine is Y = a+ bx, plus arandom 
draw from the 5-box. 


Second causal mechanism. You feed x and y—any x and y that you 
like—into machine #2. The output from the machine is Z = c+dx+ey, 
plus a random draw from the €-box. 


Linkage. You don’t feed anything into anything. Nature chooses X at 
random from the X-box, independent of the 5’s and e’s. She puts X into 
machine #1, to generate a Y. She puts the same X—and the Y she just 
generated—into machine #2, to generate Z. You get to see (X, Y, Z) for 
each subject. This is Dr. Arbuthnot’s model for his observational data. 


Estimation. You estimate a, b, c, d, e by OLS, from the observational 
data, namely, triples of observed values on (X, Y, Z) for many subjects. 


Causal inference. You can say what would happen if you could get your 
hands on the machines and put an x into machine #1. You can also say 
what would happen if you could put x and y into machine #2. 


You never do touch the machines. (After all, these are purely theoretical 
entities.) Still, you seem to be free to use your own x’s and y’s, rather than the 
ones generated by Nature, as inputs. You can say what the machines would 
do if you chose the inputs. That is causal inference from observational data. 
Causal inference is legitimate because—by assumption—you know the social 
physics: response schedules (17) and (18). 

What about the assumptions? Checking (17) and (18), which involve 
potential outcomes, is going to be hard work. Checking the statistical as- 
sumptions will not be much easier. The usual point of running regressions is 
to make causal inferences without doing real experiments. On the other hand, 
without the real experiments, the assumptions behind the models are going 
to be iffy. Inferences get made by ignoring the iffiness of the assumptions. 
That is the paradox of causal inference by regression, and a good reason for 
Dr. Braithwaite’s skepticism. 

Path models do not infer causation from association. Instead, path mod- 
els assume causation through response schedules, and—using additional sta- 
tistical assumptions—estimate causal effects from observational data. The 
statistical assumptions (independence, expectation zero, constant variance) 
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justify estimating coefficients by ordinary least squares. With large samples, 
standard errors, confidence intervals, and significance tests would follow. 
With small samples, the errors would have to follow a normal distribution in 
order to justify t-tests. 

Evaluating the statistical models in chapters 1—6. Earlier in the book, 
we discussed several examples of causal inference by regression—Yule on 
poverty, Blau and Duncan on stratification, Gibson on McCarthyism. We 
found serious problems. These studies are among the strongest in the social 
sciences, in terms of clarity, interest, and data analysis. (Gibson, for example, 
won a prize for best paper of the year—and is still viewed as a landmark study 
in political behavior.) The problems are built into the assumptions behind the 
statistical models. 


Typically, a regression model assumes causation and uses the data 
to estimate the size of a causal effect. If the estimate isn’t sta- 
tistically significant, lack of causation is inferred. Estimation and 
significance testing require statistical assumptions. Therefore, you 
need to think about the assumptions—both causal and statistical— 
behind the models. If the assumptions don’t hold, the conclusions 
don’t follow from the statistics. 


Selection vs intervention 


The conditional expectation of Y given X = x is the average of Y for 
subjects with X = x. (We ignore sampling error for now.) The response- 
schedule formalism connects two very different ideas of conditional expec- 
tation: (i) selecting the subjects with X = x, versus (ii) intervening to set 
X =x. The first is something you can actually do with observational data. 
The second would require manipulation. Response schedules crystallize the 
assumptions you need to get from selection to intervention. (Intervention 
means interrupting the natural flow of events in order to manipulate a vari- 
able, as in an experiment; the contrast is with passive observation.) 


Selection is one thing, intervention is another. 


Structural equations and stable parameters 


In econometrics, “structural” equations describe causal relationships. 
Response schedules give a clearer meaning to this idea, and to the idea of 
“stability under intervention.” The parameters in a path diagram, for instance, 
are defined through response schedules like (17) and (18), separately from 
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the data. By assumption, these parameters are constant across (i) subjects 
and (ii) levels of treatment. Moreover, (iii) the parameters stay the same 
whether you intervene or just observe the natural course of events. Response 
schedules bundle up these assumptions for us, along with similar assumptions 
on the error distributions. Assumption (iii) is sometimes called “constancy” 
or “invariance” or “stability under intervention.” 


Regression equations are structural, with parameters that are sta- 
ble under intervention, when the equations derive from response 
schedules. 


Ambiguity in notation 


Look back at figure 3. In the observational study, there is an X; for 
each subject i. In some contexts, X just means the X; for a generic subject. 
In other contexts, X is the vector whose ith component is X;. Often, X is 
the design matrix. This sort of ambiguity is commonplace. You have to pay 
attention to context, and figure out what is meant each time. 


Exercise set E 


1. In the path diagram below, free arrows are omitted. How many free ar- 
rows should there be, where do they go, and what do they mean? What 
does the curved line mean? The diagram represents some regression 
equations. What are the equations? the parameters? State the assump- 
tions that would be needed to estimate the parameters by OLS. What 
data would you need? What additional assumptions would be needed to 
make causal inferences? Give an example of a qualitative causal infer- 
ence that could be made from one of the equations. Give an example of 
a quantitative causal inference. 


U 


2. With the assumptions of this section, show that a regression of Y; on X; 
gives unbiased estimates, conditionally on the X;’s, of a and b in (17). 
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Show also that a regression of Z; on X; and Y; gives unbiased estimates, 
conditionally on the X;’s and Y;’s, of c, d, and e in (18). Hints. What are 
the design matrices in the two regressions? Can you verify assumptions 
(4.2)-(4.5)? [Cross-references: (4.2) is equation (2) in chapter 4.] 


3. Suppose you are only interested in the effects of X and Y on Z; you are 
not interested in the effect of X on Y. You are willing to assume the 
response schedule (18), with IID errors €;, independent of the X;’s and 
Y;’s. How would you estimate c, d, e? Do the estimates have a causal 
interpretation? Why? 


4. True or false, and explain. 
(a) In figure 1, father’s education has a direct influence on son’s occu- 
pation. 
(b) In figure 1, father’s education has an indirect influence on son’s 
occupation through son’s education. 
(c) Inexercise 1, U has a direct influence on Y. 
(d) Inexercise 1, V has a direct influence on Y. 


5. Suppose Dr. Arbuthnot’s models are correct; and in his data, X77 = 
12, Y77 =2, Z77 = 29. 
(a) How much bigger would Y77 have been, if Dr. Arbuthnot had inter- 
vened, setting X77 to 13? 
(b) How much bigger would Z77 have been, if Dr. Arbuthnot had in- 
tervened, setting X77 to 13 and Y77 to 5? 


6. An investigator writes, “Statistical tests are a powerful tool for deciding 
whether effects are large.” Do you agree or disagree? Discuss briefly. 


6.6 Dummy variables 


A “dummy variable” takes the value 0 or 1. Dummy variables are used 
to represent the effects of qualitative factors in a regression equation. Some- 
times, dummies are even used to represent quantitative factors, in order to 
weaken linearity assumptions. (Dummy variables are also called “indicator” 
variables or “binary” variables; programmers call them “flags.”’) 


Example. A company is accused of discriminating against female em- 
ployees in determining salaries. The company counters that male employees 
have more job experience, which explains the salary differential. To explore 
that idea, a statistician might fit the equation 


Y =a+bMAN +c EXPERIENCE + error. 
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Here, MAN is adummy variable, taking the value 1 for men and 0 for women. 
EXPERIENCE would be years of job experience. A significant positive value 
for b would be taken as evidence of discrimination. 

Objections could be raised to the analysis. For instance, why does EX- 
PERIENCE have a linear effect? To meet that objection, some analysts would 
put in a quadratic term: 


Y =a+bMAN + c EXPERIENCE + d EXPERIENCE? + error. 
Others would break up EXPERIENCE into categories, e.g., 


category 1 under 5 years 
category 2 5-10 years (inclusive) 
category 3 over 10 years 


Then dummies for the first two categories could go into the equation: 
Y; = a + bMAN + cı CAT; + c2 CAT? + error. 


For example, CAT, is 1 for all employees who have less than 5 years of 
experience, and 0 for the others. Don’t put in all three dummies: if you do, 
the design matrix won’t have full rank. 

The coefficients are a little tricky to interpret. You have to look for 
the missing category, because effects are measured relative to the missing 
category. For MAN, it’s easy. The baseline is women. The equation says 
that men earn b more than women, other things equal (experience). For 
CAT}, it’s less obvious. The baseline is the third category, over 10 years of 
experience. The equation says that employees in category 1 earn cı more 
than employees in category 3. Furthermore, employees in category 2 earn c2 
more than employees in category 3. 

We expect cı and c2 to be negative, because long-term employees get 
higher salaries. Similarly, we expect cı < c2. Other things are held equal 
in these comparisons, namely, gender. (Saying that Harriet earns —$5,000 
more than Harry is a little perverse; ordinarily, we would talk about earning 
$5,000 less: but this is statistics.) 

Of course, the argument would continue. Why these categories? What 
about other variables? If people compete with each other for promotion, how 
can error terms be independent? And so forth. The point here was just to 
introduce the idea of dummy variables. 


Types of variables 


A qualitative or categorical variable is not numerical. Examples include 
gender and marital status, values for the latter being never-married, married, 
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widowed, divorced, separated. By contrast, a quantitative variable takes nu- 
merical values. If the possible values are few and relatively widely separated, 
the variable is discrete; otherwise, continuous. These are useful distinctions, 
but the boundaries are a little blurry. A dummy variable, for instance, can 
be seen as converting a categorical variable with two values into a numerical 
variable taking the values 0 and 1. 


6.7 Discussion questions 
Some of these questions cover material from previous chapters. 


1. A regression of wife’s educational level (years of schooling) on hus- 
band’s educational level gives the equation 


WifeEdLevel = 5.60 + 0.57 x HusbandEdLevel + residual. 


(Data are from the Current Population Survey in 2001.) If Mr. Wang’s 
company sends him back to school for a year to catch up on the latest 
developments in his field, do you expect Mrs. Wang’s educational level 
to go up by 0.57 years? If not, what does the 0.57 mean? 


2. In equation (10), ô is a random error; there is a ô for each state. Gibson 
finds that ĝi is statistically insignificant, while bo is highly significant 
(two-tailed). Suppose that Gibson computed his P-values from the stan- 
dard normal curve; the area under the curve between —2.58 and +2.58 
is 0.99. True or false and explain— 


(a) The absolute value of po is more than 2.6 times its standard error. 


(b) The statistical model assumes that the random errors are indepen- 
dent across states. 


(c) However, the estimated standard errors are computed from the data. 


(d) The computation in (c) can be done whether or not the random errors 
are independent across states: the computation uses the tolerance 
scores and repression scores, but does not use the random errors 
themselves. 


(e) Therefore, Gibson’s significance tests are fine, even if the random 
errors are dependent across states. 


3. Timberlake and Williams (1984) offer a regression model to explain 
political oppression (PO) in terms of foreign investment (FI), energy 
development (EN), and civil liberties (CV). High values of PO corre- 
spond to authoritarian regimes that exclude most citizens from political 
participation. High values of CV indicate few civil liberties. Data were 
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collected for 72 countries. The equation proposed by Timberlake and 
Williams is 


PO =a+bFI+cEN+dCV + random error, 


with the usual assumptions about the random errors. The estimated co- 
efficient b of FI is significantly positive, and is interpreted as measuring 
the effect of foreign investment on political oppression. 


(a) There is one random error for each so there are 
random errors in all. Fill in the blanks. 


(b) What are the “usual assumptions” on the random errors? 


(c) From the data in the table below, can you estimate the coefficient 
a in the equation? If so, how? If not, why not? What about b? 


(d) How can b be positive, given that r(FI, PO) is negative? 
(e) From the data in the table, can you tell whether Í is significantly 
different from 0? If so, how? If not, why not? 


(f) Comment briefly on the statistical logic used by Timberlake and 
Williams. Do you agree that foreign investment causes political 
oppression? You might consider the following points. (i) Does CV 
belong on the right hand side of the equation? (ii) If not, and you 
drop it, what happens? (iii) What happens if you run a regression 
of CV on PO, FI, and EN? 


The Timberlake and Williams data. 72 countries. Corre- 
lation matrix for political oppression (PO), foreign invest- 
ment (FI), energy development (EN), and civil liberties 


(CV). 
PO FI EN CV 
PO 1.000 —.175 —.480 +.868 
FI —.175 1.000 +.330 —.391 


EN —.480  +.330 1.000 —.430 
CV +.868 —.391  -—.430 1.000 


Note. Regressions can be done with a pocket calculator, but it’s easier 
with a computer. We’re using different notation from the paper. 


Alba and Logan (1993) develop a regression model to explain residential 
integration. The equation is Y; = X;6 + 6;, where i indexes individuals 
and X; is a vector of three dozen dummy variables describing various 
characteristics of subject i, including— 
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AGE GROUP under 5, 5—17,... 
HOUSEHOLD TYPE married couple, ... 
INCOME LEVEL under $5,000, $5,000—$10,000, . .. 
EDUCATIONAL LEVEL grammar school, some high school, .... 


The parameter vector $ is taken as constant across subjects within each 
of four demographic groups (Asians, Hispanics, non-Hispanic blacks, 
non-Hispanic whites). The dependent variable Y; is the percentage of 
non-Hispanic whites in the town where subject i resides, and is the same 
for all subjects in that town. Four equations are estimated, one for each 
of the demographic groups. Estimation is by OLS, with 1980 census 
data on 674 suburban towns in the New York metropolitan area. The R?’s 
range from 0.04 to 0.29. Some coefficients are statistically significant 
for certain groups but not others, which is viewed as evidence favoring 
one theory of residential integration rather than another. Do the OLS 
assumptions apply? If not, how would this affect statistical significance? 
Discuss briefly. 


5. Rodgers and Maranto (1989) developed a model for 


“the complex causal processes involved. . . . [in] the determinants of 
publishing success. . . . the good news is that academic psychologists 
need not attend a prestigious graduate program to become a produc- 
tive researcher. ... the bad news is that attending a nonprestigious 
PhD program remains an impediment to publishing success.” 


The Rodgers-Maranto model (figure 7 in the paper) is shown in the 
diagram below. 


$2 ABILITY 


PREPROD 


SEX = 


The investigators sent questionnaires to a probability sample of 932 
members of the American Psychological Association who were currently 
working as academic psychologists, and obtained data on 241 men and 
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244 women. Cases with missing data were deleted, leaving 86 men and 
76 women. Variables include— 


SEX 
ABILITY 


GPQ 


QFJ 
PREPROD 


PUBS 


CITES 


respondent’s gender (a dummy variable). 


measures selectivity of respondent’s undergraduate insti- 
tution, respondent’s membership in Phi Beta Kappa, etc. 


measures the quality of respondent’s graduate institution, 
using national rankings, publication rates of faculty, etc. 


measures quality of respondent’s first job. 


respondent’s quality-weighted number of publications 
before the PhD. (Mean is 0.8, SD is 1.6.) 


number of respondent’s publications within 6 years after 
the PhD. (Mean is 7, SD is 6.) 


number of times PUBS were cited by others. (Mean is 
20, SD is 44.) 


Variables were standardized before proceeding with the analysis. Six 
models were developed but considered inferior to the model shown in 
the diagram. What does the diagram mean? What are the numbers on 
the arrows? Where do you see the good news/bad news? Do you believe 
the news? Discuss briefly. 


6. A balance gives quite precise measurements for the difference between 
weights that are nearly equal. A, B, C, D each weigh about 1 kilogram. 
The weight of A is known exactly: it is 53 ug above a kilogram, where 
a ug is a millionth of a gram. (A kilogram is 1000 grams.) The weights 
of B, C, D are determined through a “weighing design” that involves 6 
comparisons shown in the table below. 


Comparison Difference in wg 
AandB vs CandD +42 
AandC vs Band D —12 
AandD vs B and C +10 
BandC vs A and D —65 
Band D vs A and C —17 
CandD vs A and B +11 


According to the first line in the table, for instance, A and B are put on 
the left hand pan of the balance; C and D on the right hand pan. The 
difference in weights (left hand pan minus right hand pan) is 42 ug. 


(a) Are these data consistent or inconsistent? 
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(b) What might account for the inconsistencies? 

(c) How would you estimate the weights of B, C, and D? 
(d) Can you put standard errors on the estimates? 

(e) What assumptions are you making? 


Explain your answers. 


7. (Hard.) There is a population of N subjects, indexed by i = 1,..., N. 
Associated with subject i there is a number v;. A sample of size n is 
chosen at random without replacement. 


(a) Show that the sample average of the v’s is an unbiased estimate of 
the population average. (There are hints below.) 


(b) If the sample v’s are denoted V1, V2, ..., Vn, show that the proba- 
bility distribution of V2, Vi,..., Vn is the same as the probability 
distribution of V1, V2,..., Va. In fact, the probability distribution 
of any permutation of the V’s is the same as any other: the sample 
is exchangeable. 


Hints. If you’re starting from scratch, it might be easier to do part (b) 
first. For (b), a permutation z of {1,..., N}is a 1-1 mapping of this set 
onto itself. There are N! permutations. You can choose a sample of size 
n by choosing z at random, and taking the subjects with index numbers 
m(1),..., (n) as the sample. 


8. There is a population of N subjects, indexed by i = 1,...,N. A 
treatment x can be applied at level 0, 10, or 50. Each subject will be 
assigned to treatment at one of these levels. Subject i has a response yj; x 
if assigned to treatment at level x. For instance, with a drug to reduce 
cholesterol levels, x would be the dose and y the cholesterol level at the 
end of the experiment. Note: y; x is fixed, not random. 


Each subject i has a 1 x p vector of personal characteristics w;, unaffected 
by assignment. In the cholesterol experiment, these characteristics might 
include weight and cholesterol level just before the experiment starts. If 
you assign subject i to treatment at level 10, say, you observe yj;,19 but 
not yj,9 OF yi,590. You can always observe w;. Population parameters of 
interest are 


ee i ee 
ao = N 2 Yi,0; a10 = N 3 Yi, 10> aso = N 2 Yi,50 + 


[Question continues on next page. ] 


10. 


11. 


12. 


CHAPTER 6 


The parameter a is the average result we would see if all subjects were 
put into treatment at level 0. We could measure this directly, by assigning 
all the subjects to treatment at level 0, but would then lose our chance to 
learn about the other parameters. 


Suppose nọ, n1, n2 are positive numbers whose sum is N. In a “random- 
ized controlled experiment,” no subjects are chosen at random without 
replacement and assigned to treatment at level 0. Then nı subjects are 
chosen at random without replacement from the remaining subjects and 
assigned to treatment at level 10. The last n2 subjects are assigned to 
treatment at level 50. From the experimental data— 


(a) Can you estimate the three population parameters of interest? 


(b) Can you estimate the average response if all the subjects had been 
assigned to treatment at level 75? 


Explain briefly. 


(This continues question 8.) Let X; = x if subject i is assigned to 
treatment at level x. A simple regression model says that given the 
assignments, the response Y; of subject i is w+ X; p + €;, where a, B are 
scalar parameters and the €; are IID with mean 0 and variance o?. Does 
randomization justify the model? If the model is true, can you estimate 
the average response if all the subjects had been assigned to treatment 
at level 75? Explain. 


(This continues questions 8 and 9.) Let Y; be the response of subject 
i. According to a multiple regression model, given the assignments, 
Y; =a+X;B+wiy +e;, where w; is a vector of personal characteristics 
for subject i (question 8); œ, 6 are scalar parameters, y is a vector of 
parameters, and the €; are IID with mean 0 and variance o*. Does 
randomization justify the model? If the model is true, can you estimate 
the response if a subject with characteristics wj is assigned to treatment 


at level 75? Explain. 


Suppose (X;, €;) are IID as pairs fori = 1,...,, with E(e€;) = 0 and 
var(e;) = o?. Here X; isa 1x p random vector and €; is a random 
variable (unobservable). Suppose E(X;X,) is p x p positive definite. 
Finally, Y; = X; +€; where £ is a px 1 vector of unknown parameters. 
Is OLS biased or unbiased? Explain. 


To demonstrate causation, investigators have used (i) natural experi- 
ments, (ii) randomized controlled experiments, and (iii) regression mod- 
els, among other methods. What are the strengths and weaknesses of 
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13. 


14. 


15. 


16. 


17. 


methods (i), (ii), and (iii)? Discuss, preferably giving examples to illus- 
trate your points. 


True or false, and explain: if the OLS assumptions are wrong, the com- 
puter can’t fit the model to data. 


An investigator fits the linear model Y = Xf + e. The OLS estimate 
for £ is Ê, and the fitted values are Y. The investigator writes down the 
equation Y = XB + ê. What is ê? 

Suppose the X; are IID N (0, 1). Let e; = 0.025(Xf — 3X?) and Y; = 
Xi + €i. An investigator does not know how the data were generated, 
and runs a regression of Y on X. 


(a) Show that R is about 0.97. (This is hard.) 
(b) Do the OLS assumptions hold? 


(c) Should the investigator trust the usual regression formulas for stan- 
dard errors? 


Hints. Part (a) can be done by calculus—see the end notes to chapter 5 
for the moments of the normal distribution—but it gets a little intricate. 
A computer simulation may be easier. Assume there is a large sample, 
e.g., n = 500. 


Assume the response schedule Y;, = a + bx + €;. The e; are IID 
N(0, 0”). The variables X; are IID N (0, t7). In fact, the pairs (€;, X;) 
are IID in i, and jointly normal. However, the correlation between 
(€i, Xi) is p, which may not be 0. The parameters a, b, o*, t?, p are 
unknown. You observe X; and Y; = Y; x; fori = 1,..., 500. 


(a) If you run a regression of Y; on X;, will you get unbiased estimates 
for a and b? 


(b) Is the relationship between X and Y causal? 
Explain briefly. 
A Statistician fits a regression model (n = 107, p = 6) and tests whether 


the coefficient she cares about is 0. Choose one or more of the options 
below. Explain briefly. 


(i) The null hypothesis says that 62 = 0. 
(ii) The null hypothesis says that po = 0. 
(iii) The null hypothesis says that t = 0. 
(iv) The alternative hypothesis says that 62 Æ 0. 
(v) The alternative hypothesis says that bo £0. 
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(vi) The alternative hypothesis says that t Æ 0. 
(vii) The alternative hypothesis says that Bo is Statistically significant. 


18. Doctors often use body mass index (BMI) to measure obesity. BMI is 
weight/height*, where weight is measured in kilograms and height in 
meters. A BMI of 30 is getting up there. For American women age 18- 
24, the mean BMI is 24.6 and the variance is 29.4. Although the BMI 


for a typical woman in this group is something like , the BMI 
of a typical woman will deviate from that central value by something 
like . Fill in the blanks; explain briefly. 


19. Anepidemiologist says that “randomization does not exclude confound- 
ing... confounding is very likely if information is collected—as it should 
be—on a sufficient number of baseline characteristics. . . .” Do you agree 
or disagree? Discuss briefly. 


Notes. “Baseline characteristics” are characteristics of subjects mea- 
sured at the beginning of the study, i.e., just before randomization. The 
quote, slightly edited, is from Victora et al (2004). 


20. A political scientist is studying a regression model with the usual as- 
sumptions, including HD errors. The design matrix X is fixed, with full 
rank p = 5, and n = 57. The chief parameter of interest is 2 — 84. One 
possible estimator is Bo — Ba, where p= (X’X)—! X’Y. Is there another 
linear unbiased estimator with smaller variance? Explain briefly. 


6.8 End notes for chapter 6 


Discussion questions. In question 5, some details of the data analy- 
sis are omitted. Question 6 is hypothetical. Two references on weighing 
designs are Banerjee (1975) and Cameron et al (1977); these are fairly tech- 
nical. Question 16 illustrates endogeneity bias. Background for question 
19: epidemiologists like to adjust for imbalance in baseline characteristics 
by statistical modeling, on the theory that they’re getting more power—as 
they would, if their models were right. 


Measurement error. This is an important topic, not covered in the text. 
In brief, random error in Y can be incorporated into €, as in the example on 
Hooke’s law. Random error in X usually biases the coefficient estimates. The 
bias can go either way. For example, random error in a confounder can make 
an estimated effect too big; random error in measurements of a putative cause 
can dilute the effect. Biased measurements of X or Y create other problems. 
There are ways to model the impact of errors, both random and systematic. 
Such correctives would be useful if the supplementary models were good 
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approximations. Arguments get very complicated very quickly, and benefits 
remain doubtful (Freedman 1987, 2005). Adcock and Collier (2001) have a 
broader discussion of measurement issues in the social sciences. 


Dummy variable. The term starts popping up in the statistical literature 
around 1950: see Oakland (1950) or Klein (1951). The origins are unclear, 
but the Oxford English Dictionary notes related usage in computer science 
around 1948. 


Current Population Survey. This survey is run by the US Bureau of 
the Census for the Bureau of Labor Statistics, and is the principal source 
of employment data in the US. There are supplementary questionnaires on 
other topics of interest, including computer use, demographics, and electoral 
participation. For information on the design of the survey, see Freedman- 
Pisani-Purves (2007, chapter 22). 


Path diagrams. The choice of variables and arrows in a path diagram is 
up to the analyst, as are the directions in which the arrows point, although some 
choices may fit the data less well, and some choices may be illogical. If the 
graph is “complete”—every pair of nodes joined by an arrow—the direction 
of the arrows is not constrained by the data (Freedman 1997, pp. 138, 142). 
Ordering the variables in time may reduce the number of options. There are 
some algorithms that claim to be able to induce the path diagram from the 
data, but the track record is not good (Freedman 1997, 2004; Humphreys and 
Freedman 1996, 1999). Achen (1977) is critical of standardization; also see 
Blalock (1989). Pearl (1995) discusses direct and indirect effects. 


Response schedules provide a rationale for the usual statistical analysis 
of path diagrams, and there seems to be no alternative that is much simpler. 
The statistical assumptions can be weakened a little; see, e.g., (5.2). Figure 4 
suggests that the X’s are IID. This is the best case for path diagrams, especially 
when variables are standardized, but all that is needed is exogeneity. Setting 
up parameters when non-IID data are standardized is a little tricky; see, e.g., 


http://www.stat.berkeley.edu/users/census/standard.pdf 


The phrase “response schedule” combines “response surface” from statis- 
tics with “supply and demand schedules” from economics (chapter 9). One 
of the first papers to mention response schedules is Bernheim, Shleifer, and 
Summers (1985, p. 1051). Some economists have started to write “supply 
response schedule” and “demand response schedule.” 


Invariance. The discussion in sections 4-5 assumes that errors are in- 
variant under intervention. It might make more sense to assume that the error 
distributions are invariant, rather than the errors themselves (Freedman 2004). 
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Ideas of causation. Embedded in the response-schedule formalism is 
the conditional distribution of Y, if we were to intervene and set the value of 
X. This conditional distribution is a counter-factual, at least when the study 
is observational. The conditional distribution answers the question, what 
would have happened if we had intervened and set X to x, rather than letting 
Nature take its course? The idea is best suited to experiments or hypothetical 
experiments. (The latter are also called “thought experiments” or “gedanken 
experiments.”) The formalism applies less well to non-manipulationist ideas 
of causation: the moon causes the tides, earthquakes cause property values 
to go down, time heals all wounds. Time is not manipulable; neither are 
earthquakes or the moon. 

Investigators may hope that regression equations are like laws of motion 
in classical physics: if position and momentum are given, you can deter- 
mine the future of the system and discover what would happen with different 
initial conditions. Some other formalism may be needed to make this non- 
manipulationist account more precise. Evans (1993) has an interesting survey 
of causal ideas in epidemiology, with many examples. In the legal context, 
the survey to read is Hart and Honoré (1985). 


Levels of measurement. The idea goes back to Yule (1900). Stephens 
(1946) and Lord (1953) are other key references. 


Otis Dudley Duncan was one of the great empirical social scientists of 
the 20th century. Blau and Duncan (1967) were optimistic about the use of 
statistical models in the social sciences, but Duncan’s views darkened after 
20 years of experience— 


“Coupled with downright incompetence in statistics, paradoxically, we 
often find the syndrome that I have come to call statisticism: the notion 
that computing is synonymous with doing research, the naive faith that 
statistics is a complete or sufficient basis for scientific methodology, the 
superstition that statistical formulas exist for evaluating such things as 
the relative merits of different substantive theories or the ‘importance’ of 
the causes of a ‘dependent variable’; and the delusion that decomposing 
the covariations of some arbitrary and haphazardly assembled collec- 
tion of variables can somehow justify not only a ‘causal model’ but also, 
praise the mark, a ‘measurement model.’ There would be no point in 
deploring such caricatures of the scientific enterprise if there were a 
clearly identifiable sector of social science research wherein such falla- 
cies were clearly recognized and emphatically out of bounds.” (Duncan 
1984, p. 226) 
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7.1 Introduction 


Maximum likelihood is a general (and, with large samples, very power- 
ful) method for estimating parameters in a statistical model. The maximum 
likelihood estimator is usually called the MLE. Here, we begin with textbook 
examples like the normal, binomial, and Poisson. Then comes the probit 
model, with a real application—the effects of Catholic schools (Evans and 
Schwab 1995, reprinted at the back of the book). This application will show 
the strengths and weaknesses of the probit model in action. 


Example 1. N (u, 1) with —oo < u < ow. The density at x is 


1 1 2 
exp| — ~—(x — u)“ |, where exp(x) = e”. 
a a l p 
See section 3.5. For n independent N (u, 1) variables X1, . . . , Xn, the density 
at x1, ..., Xn iS 
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The likelihood functionis the density evaluated at the data X1,..., Xn, viewed 
as a function of the parameter u. The log likelihood function is more useful: 


1 n 
in) = E > Ai — w)? — n log (v27). 
i=l 


The notation makes it explicit that L, (u) depends on the sample size n and 
the parameter jz. There is also dependence on the data, because the likelihood 
function is evaluated at the X;: look at the right hand side of the equation. 

The MLE is the parameter value ji that maximizes L,,(w). To find the 
MLE, you can start by differentiating Ln (u) with respect to u: 


Ly (uw) =} Zi- n). 
i=l 


Set L’ (u) to 0 and solve. The unique u with L’ (uw) = 0 is fi = X, the 
sample mean. Check that 

Li (u) = ~n. 
Thus, X is the maximum not the minimum. (Here, L’, means the derivative 
not the transpose, and L” is the second derivative.) 

What is the idea? Let’s take the normal model for granted, and try to 
estimate the parameter u from the data. The MLE looks for the value of u 
that makes the data as likely as possible—given the model. Technically, that 
means looking for the u which maximizes Ly, (u). 


Example 2. Binomial(1, p) with0 < p < 1. Let X; be independent. 
Each X; is 1 with probability p and 0 with remaining probability 1 — p, so 
Xi has the Binomial(1, p) distribution. Let x; = 0 or 1. The probability that 
Xi =x; fori = 1,...,n is 


n 
[[ pa- p ™. 
i=1 
The reasoning: due to independence, the probability is the product of n 
factors. If x; = 1, the ith factor is P,(X; = 1) = p = p*(1 — pl“, 
because (1 — p)? = 1. If x; = 0, the factor is P(X; = 0) = 1- p = 
př (1 — p)!™™, because p? = 1. (Here, Py is the probability that governs 
the X;’s when the parameter is p.) Let S = X1 + ---+ Xn. Check that 
n 
La (p) = >_ [Xi log p + (1 — Xj) log(1 — p)] 
i=l 
= Slog p+ (n — S) log(1 — p). 
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Now 
Eez n—sS 
A a6 = 
and 
L"(p) = S n—-S 
mp py 


The MLE is p = S/n. 

If S$ = 0, the likelihood function is maximized at p = 0. This is an 
“endpoint maximum.” Similarly, if S = n, the likelihood function has an 
endpoint maximum at p = 1. In the first case, L’, < 0 on (0,1). In the 
second case, L’, > 0 on (0, 1). Either way, the equation L’(p) = 0 has no 
solution. 


Example 3. Poisson(A) with 0 < à < oo. Let X; be independent 
Poisson(A). If j = 0, 1,... then 


J 
P(X, = j) = et 


and 


n 
PE E eG 
P, (Xi = ji fori = 1,..., n) = eM Ait tin [I A 
jet Jt 


where P, is the probability distribution that governs the X;’s when the pa- 
rameter is A. Let S = X1 +---+ Xn. So 


n 
Ln) = —nd + Slogà — È` log(X;!). 


i=1 


Now 
y S 
L (à) = —n + z 
and 
H S 
L, à) = — 5a: 


The MLE is & = S/n. (This is an endpoint maximum if S$ = 0.) 


Example 4. Let X be a positive random variable, with Pa (X > x) = 
0/(0 + x) for 0 < x < oo, where the parameter 0 is a positive real number. 
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The distribution function of X is x/(@ + x). The density is 0/(6 + x)?. Let 
X1,..., Xn be independent, with density 0/(6 + x). Then 


n 
Ln (6) =nlogé — aye log(0 + X;). 
i=l 


Now 


n 
n 1 
L’ (0) = 2 
n@) 0 TES? 
and 


L!@) = 


n 
n 1 
+ 2 : 
02 2 (0 + X;)? 


There is no explicit formula for the MLE, but you can find it by numerical 
methods on the computer. (Computer labs 10-12 at the back of the book 
will get you started on numerical maximization, or see the end notes for the 
chapter; a detailed treatment is beyond our scope.) This example is a little 
artificial. It will be used to illustrate some features of the MLE. 


REMARKS. In example 1, the sample mean X is N (u, 1/n). Inexample2, 
the sum is Binomial (n, p): 


P(S =j)= ("Jor -= p)". 
In example 3, the sum is Poisson(nA): 


a) 
p(s = jer 
j! 
DEFINITION. There is a statistical model parameterized by 6. The Fisher 
information is Ig = — Eo [En @)], namely, the negative of the expected value 
of the second derivative of the log likelihood function, for a sample of size 1. 


THEOREM |. Suppose X1,..., Xn are IID with probability distribution 
governed by the parameter 6. Let 6o be the true value of 6. Under regularity 
conditions (which are omitted here), the MLE for 0 is asymptotically normal. 
The asymptotic mean of the MLE is 6). The asymptotic variance can be 
computed in three ways: 


Gain, 
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(ii) 17'/n, 
Gii) [-L Ô)". 


If 6 is the MLE and vy is the asymptotic variance, the theorem says that 
(6 — 69)/,/Un is nearly N (0, 1) when the sample size n is large—and we’re 
sampling from 6. (“Asymptotic” results are nearly right for large samples.) 
The [—L”(6)] in (ii) is often called “observed information.” With option 
(iii), the sample size n is built into L,,: there is no division by n. 

The MLE can be used in multi-dimensional problems, and theorem 1 
generalizes. When the parameter vector 0 is p dimensional, L’(@) is a p 
vector. The jth component of L’(@) is ƏL/30;. Furthermore, L’(6) is a 
p x p matrix. The ijth component of L” (0) is 

aL a7 L 
36; 36; 96; 30; ` 
We’re assuming that L is smooth. Then the matrix L” is symmetric. We still 
denne Ig = — Eo [L i (0)]. This is now a px p matrix. The diagonal elements 
of i /n give asymptotic variances for the components of 6; the off-diagonal 
elements, the covariances. Similar comments apply to — L” (ô). 

What about independent variables that are not idenfically disivibutel? 
Theorem 1 can be extended to cover this case, although options (i) and (ii) 
for asymptotic variance get a little more complicated. For instance, option (i) 
becomes {— Eg [L (49)]}~!. Observed information is still a good option, 
even if the likelihood function is harder to compute. 

The examples. The normal, binomial, and Poisson are “exponential 
families” where the theory is especially attractive (although it is beyond our 
scope). Among other things, the likelihood function generally has a unique 
maximum. With other kinds of models, there are usually several local maxima 
and minima. 

Caution. Ordinarily, the MLE is biased—although the bias is small with 
large samples. The asymptotic variance is also an approximation. Moreover, 
with small samples, the distribution of the MLE is often far from normal. 


Exercise set A 


1. In example 1, the log likelihood function is a sum—as it is in examples 
2, 3, and 4. Is this a coincidence? If not, what is the principle? 

2. (a) Suppose X1, X2,..., Xn are ID N (u, 1). Find the mean and vari- 
ance of the MLE for u. Find the distribution of the MLE andcom- 
pare to the theorem. Show that — L" (û)/n —> I,,. Comment: for 
the normal, the asymptotics are awfully good. 
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10. 


11. 


12. 


13. 


14. 
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(b) If U is N(O, 1), show that U is symmetric: namely, P(U < y) = 
P(-U < y). Hints. (i) P(-U < y) = P(U > —y), and 
(ii) exp(—x7/2) is a symmetric function of x. 


Repeat 2(a) for the binomial in example 2. Is the MLE normally dis- 
tributed? Or is it only approximately normal? 


Repeat 2(a) for the Poisson in example 3. Is the MLE normally dis- 
tributed? Or is it only approximately normal? 


Find the density of 0U /(1— U), where U is uniform on [0,1] and 6 > 0. 


Suppose the X; > 0 are independent, and their common density is 
6/(O+ x)? fori = 1,...,n, as in example 4. Show that 0L’, (0) = 
—n+2)~"_, Xi/(@ + Xi). Deduce that 6 —> @L' (6) decreases from 
n to —n as 0 increases from 0 to oo. Conclude that L, has a unique 
maximum. (Reminder: L’, means the derivative not the transpose.) 


What is the median of X in example 4? 
Show that the Fisher information in example 4 is 1/(367). 


Suppose X; are independent for i = 1,...,n, with a common Poisson 
distribution. Suppose E(X;) = A > 0, but the parameter of interest is 
6 = 22. Find the MLE for 0. Is the MLE biased or unbiased? 


As in exercise 9, but the parameter of interest is 9 = JVA. Find the MLE 
for 6. Is the MLE biased or unbiased? 


Let $ be a positive real number, which is unknown. Suppose X; are 
independent Poisson random variables, with E(X;) = fi fori = 
1,2,..., 20. How would you estimate 6? 


Suppose X, Y, Z are independent normal random variables, each having 
variance 1. The means area+ 6, a+26, 20+, respectively: a, 6 are 
parameters to be estimated. Show that maximum likelihood and OLS 
give the same estimates. Note: this won’t usually be true—the result 
depends on the normality assumption. 


Let @ be a positive real number, which is unknown. Suppose the X; 
are independent fori = 1,...,, with a common distribution Pg that 
depends on 6: Po{X; = j} = c(0)(0 + jf) '@+ 7+ 1)7! for j = 
0,1,2,.... What is c(9)? How would you estimate 6? Hints on finding 
c(0). What is S°9°9 (aj — aj41)? What is (6 4 po S64 77-2 


Suppose X; are independent for i = 1,...,, with common density 
5 exp(—|x — 0|), where 0 is a parameter, x is real, and n is odd. Show 
that the MLE for 0 is the sample median. Hint: see exercise 2B 18. 
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7.2 Probit models 


The probit model explains a 0—1 response variable Y; for subject i in 
terms of a row vector of covariates X;. Let X be the matrix whose ith row 
is X;. Each row in X represents the covariates for one subject, and each 
column represents one covariate. Given X, the responses Y; are assumed to 
be independent random variables, taking values 0 or 1, with 


P(Y; = 0/X) = 1— ®(Xi 6), PW; = 1|X) = P(X; 8). 


Here, ® is the standard normal distribution function and ĝ is a parameter 
vector. Any distribution function could be used: ® is what makes it a probit 
model rather than a logit model or an xxxit model. 

Let’s try some examples. About one-third of Americans age 25+ read a 
book last year. Strange but true. Probabilities vary with education, income, 
and gender, among other things. In a (hypothetical) study on this issue, 
subjects are indexed by i = 1, ...,. The response variable Y; is defined as 
1 if subject i read a book last year, else Y; = 0. The vector of explanatory 
variables for subject i is X; = [1, ED;, INC;, MAN;]: 

ED; is years of schooling completed by subject i. 

INC; is the annual income of subject i, in US dollars. 


MAN; is 1 if subject i is a man, else 0. (This is a dummy variable: 
section 6.6.) 


The parameter vector 6 is 4 x 1. Given the covariate matrix X, the Y;’s are 
assumed to be independent with P(Y; = 1) = ®(X;), where ® is the 
standard normal distribution function. 

This is a lot like coin-tossing (example 2), but there is one major dif- 
ference. Each subject i has a different probability of reading a book. The 
probabilities are all computed using the same formula, ®(X; 6). The param- 
eter vector $ is the same for all the subjects. That is what ties the different 
subjects together. Different subjects have different probabilities only because 
of their covariates. Let’s do some special cases to clarify this. 


Example 5. Suppose we know that 6; = —0.35, 62 = 0.02, B3 = 
1/100,000, and 64 = —0.1. A man has 12 years of education and makes 
$40,000 a year. His X; £ is 


—0.35 + 12x 0.02 + 40,000 = 
Saran * 700,000 


= —0.35 + 0.244 0.4 — 0.1 = 0.19. 


The probability he read a book last year is ® (0.19) = 0.58. 
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A similarly situated woman has X;B = —0.35+0.24+0.4 = 0.29. The 
probability she read a book last year is ®(0.29) = 0.61, a bit higher than the 
0.58 for her male counterpart in example 5. The point of the dummy variable 
is to add 64 to X;B for male subjects but not females. Here, 64 is negative. 
(Adding a negative number is what most people would call subtraction.) 


Estimation. We turn to the case where f is unknown, to be estimated 
from the data by maximum likelihood. The probit model makes the indepen- 
dence assumption, so the likelihood function is a product with a factor for 
each subject. Let’s compute this factor for two subjects. 


Example 6. Subject is male, with 18 years of education and a salary of 
$60,000. Not a reader, he watches TV or goes to the opera for relaxation. 
His factor in the likelihood function is 


— 
© 

— 

fc 


1882 + 60,00063 + 8s). 


It’s 1 — ® because he doesn’t read. There’s +4 in the equation, because it’s 
him not her. TV and opera are irrelevant. 


Example 7. Subject is female, with 16 years of education and a salary 
of $45,000. She reads books, has red hair, and loves scuba diving. Her factor 
in the likelihood function is 


(bı + 1682 + 45,00083). 


It’s ® because she reads books. There is no $4 in the equation: her dummy 
variable is 0. Hair color and underwater activities are irrelevant. 


Since the likelihood is a product—we’ve conditioned on X—the log 
likelihood is a sum, with a term for each subject: 


Ln(B) = > (Yilog[P% = 11X)] + C= Yi) log [1 - PO = 11X)]) 
i=l 


=> (¥ilog[@(%B)] + (= Y) log [1 - 6(%8)]). 


i=l 


Readers contribute terms with log [®(X;f)]: the log[1 — ®(X;f)] drops 
out, because Y; = 1 if subject i is a reader. It’s the reverse for non-readers: 
Y; = 0, so log [Ð (X; B)] drops out and log [1 — ®(X; B)] stays in. If this isn’t 
clear, review the binomial example in section 1. 
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Given X, the Y; are independent. They are not identically distributed: 
P(Y; = 1|X) = ®(X;f) differs from one i to another. As noted ear- 
lier, Theorem 1 can be extended to cover this case, although options (i) 
and (ii) for asymptotic variance have to be revised: e.g., option (i) becomes 
{— Eg [L"(60)]}7!. We estimate 6 by maximizing L,(6). As in most appli- 
cations, this would be impossible by calculus, so it’s done numerically. The 
asymptotic covariance matrix is [—-L"(B )]~!. Observed information is used 
because it isn’t feasible to compute the Fisher information matrix analytically. 
To get standard errors, take square roots of the diagonal elements. 


Why not regression? 


You probably don’t want to tell the world that Y = Xf + €. The reason: 
Xiß is going to produce numbers other than O or 1, and X;6 + €i is even 
worse. The next option might be P(Y; = 1|X) = X;ß, the Y; being as- 
sumed conditionally independent across subjects. That’s a “linear probability 
model.” Chapter 9 has an example with additional complications. 

Given data from a linear probability model, you can estimate 6 by fea- 
sible GLS. However, there are likely to be some subjects with X; B > 1, and 
other subjects with X;ĝ < 0. A probability of 1.5 is a jolt; so is —0.3. The 
probit model respects the constraint that probabilities are between 0 and 1. 

Regression isn’t useless in the probit context. To maximize the likelihood 
function by numerical methods, it helps to have a reasonable starting point. 
Regress Y on X, and start the search from there. 


The latent-variable formulation 


The probit model is one analog of regression for binary response vari- 
ables; the logit model, discussed below, is another. So far, there is no error 
term in the picture. However, the model can be set up with something like an 
error term. To see how, let’s go back to the probit model for reading books. 

Subject i has a latent (hidden) variable U;. These are IID N (0, 1) across 
subjects, independent of the covariates. (Reminder: IID = independent and 
identically distributed.) Subject i reads books if X; + U; > 0. However, if 
X;6 + Ui < 0, then subject i is not a reader. We don’t have to worry about 
the possibility that X; + U; = 0: this is an event with probability 0. 

Given the covariate matrix X, the probability that subject i reads books 
is 

P(X;B + U; > 0) = P(U; > —XiB) = P(-U; < XiB). 
Because U; is symmetric (exercise A2), 


P(—U; < XiB) = PU; < XiB) = P(XiB). 


124 CHAPTER 7 


So P(X;ß + Ui > 0) = (X; ß). The new formulation with latent variables 
gives the right probabilities. 

The probit model now has something like an error term, namely, the 
latent variable. But there is an important difference between latent variables 
and error terms. You can’t estimate latent variables. At most, the data tell 
you X; and the sign of X; + Ui. That is not enough to determine U;. By 
contrast, error terms in a regression model can be approximated by residuals. 

The latent-variable formulation does make the assumptions clearer. The 
probit model requires the U;’s to be independent of the X;’s, and IID across 
subjects. The U;’s need to be normal. The response for subject i depends 
only on that subject’s covariates. (Look at the formulas!) 

The hard questions about probit models are usually ducked. Is HD 
realistic for reading books? Not if there’s word-of-mouth: “Hey, you have to 
read this book, it’s great.” Why are the 6’s the same for everybody? e.g., for 
men and women? Why is the effect of income the same for all educational 
levels? What about other variables? 


If the assumptions in the model break down, the MLE will be 
biased—even with large samples. The bias may be severe. Also, 
estimated standard errors will not be reliable. 


Exercise set B 


1. Let Z be N(O, 1) with density function @ and distribution function ® 
(section 3.5). True or false, and explain: 


(a) The slope of ® at x is f(x). 

(b) The area to the left of x under ¢ is P(x). 

(c) P(Z = x)= (x). 

(d) P(Z < x)= (x). 

(e) P(Z < x)= D(x). 

(f) P(x < Z <x+h)= ¢ġ(x)hifh is small and positive. 


2. In brief, the probit model for reading says that subject i read a book last 
year if X;B + U; > 0. 


(a) What are X; and £? 


(b) The U; is a variable. Options (more than one may be 
right): 


data random latent dummy observable 
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(c) What are the assumptions on U;? 


(d) The log likelihood function is a , with one 
for each . Fill in the blanks using the options below, and 
explain briefly. 


sum product quotient matrix term 
subject factor entry book variable 


3. As in example 5, suppose we know 6; = —0.35, B2 = 0.02, 63 = 
1/100,000, 64 = —0.1. George has 12 years of education and makes 
$40,000 a year. His brother Harry also has 12 years of education but 
makes $50,000 a year. True or false, and explain: according to the 
model, the probability that Harry read a book last year is 0.1 more than 
George’s probability. If false, compute the difference in probabilities. 


Identification vs estimation 


Two very technical ideas are coming up: identifiability and estimabil- 
ity. Take identifiability first. Suppose Pg is the probability distribution that 
governs X. The distribution depends on the parameter 6. Think of X as 
observable, so Pg is something we can determine. The function f(@) is 
identifiable if f(01) Æ f (62) implies Po, 4 Po, for every pair (01, 02) of pa- 
rameter values. In other words, f (0) is identifiable if changing f (0) changes 
the distribution of an observable random variable. 

Now for the second idea: the function f(@) is estimable if there is a 
function g with Eg[g(X)] = f(@) for all values of 6, where Eg stands for 
expected value computed from Pg. This is a cold mathematical definition: 
f(@) is estimable if there is an unbiased estimator for it. Nearly unbiased 
won’t do, and variance doesn’t matter. 


PROPOSITION 1. If f(@) is estimable, then f (0) is identifiable. 


Proof. If f(@) is estimable, there is a function g with Fe[g(X)] = f (0) 


for all 0. If f(@1) # f(@), then Eg [g(X)] # Ealg(X)]. So Pa, # Pay: 
i.e., 01 and 62 generate different distributions for X. 


The converse to proposition 1 is false. A parameter—or a function of 
a parameter—can be identifiable without being estimable. That is what the 
next example shows. 


Example 8. Suppose 0 < p < 1 is a parameter; X is a binomial 
random variable with P,(X = 1) = p and P,(X = 0) = 1 — p. Then ,/p is 
identifiable but not estimable. To prove identifiability, /p, #../P> implies 
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Pı É Po Then Pp, (X = 1) Æ Pp, (X = 1). What about estimating ./p, for 
instance, by g(X) —where g is some suitable function? Well, E,[g(X)] = 
(1 — p)g (0) + pg(1). This is a linear function of p. But ,/p isn’t linear. So 
,/Pp isn’t estimable: there is no g with E,[g(X)] = yp for all p. In short, 
,/Pp is identifiable but not estimable, as advertised. 


For the binomial, the parameter is one-dimensional. However, the defi- 
nitions apply also to multi-dimensional parameters. Identifiability is an im- 
portant concept, but it may seem a little mysterious. Let’s say it differently. 


Something is identifiable if you can get it from the joint distribution 
of observable random variables. 


Example 9. There are three parameters, a, b, and o*. Suppose Y; = 
a+bx,;+6; fori =1,2,..., 100. The x; are fixed and known; in fact, all the 
x; happen to be 2. The unobservable ô; are IID N (0, o”). Is a identifiable? 
estimable? How about b? a + 2b? o7? To begin with, the Y; are IID 
N(a+2b, o°). The sample mean of the Y;’s estimates a +2b. Thus, a +2b 
is estimable and identifiable. The sample variance of the Y;’s estimates o°—if 
you divide by 99 rather than 100. Thus, ø? is estimable and identifiable. 

However, a and b are not separately identifiable. For instance, if a = 0 
and b = 1, the Y; would be IID N(2, 07). Ifa = 1 and b = 0.5, the Y; 
would be IID N(2, 07). If a = /17 and b = (2 — V/17)/2, the Y; would be 
IID N(2, 07). Infinitely many combinations of a and b generate exactly the 
same joint distribution for the Y;. That is why information about the Y; can’t 
help you break a + 2b apart, into a and b. If you want to identify a and b 
separately, you need some variation in the xj. 


Example 10. Suppose U and V are independent random variables: U 
is N(a, 1) and V is N(b, 1), where a and b are parameters. Although the 
sum U + V is observable, U and V themselves are not observable. Is a + b 
identifiable? How about a? b? To begin with, E(U + V) = a +b. So 
a + bis estimable, hence, identifiable. On the other hand, if we increase a by 
some amount and decrease b by the same amount, a + b is unchanged. The 
distribution of U + V is also unchanged. Hence, a and b themselves are not 
identifiable. 


What if the U; are N(w, 07)? 


Let’s go back to the probit model for reading books, and try N (u, 1) 
latent variables. Then £,;—the intercept—is mixed up with u. You can 
identify 6; + u, but can’t get the pieces 61, u. What about N (0, o?) for 
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the latents? Without some constraint, parameters are not identifiable. For 
instance, the combination o = 1 and 6 = y produces the same probability 
distribution for the Y; given the X; aso = 2 and 6B = 2y. Setting o = 
1 makes the other parameters identifiable. There would be trouble if the 
distribution of the latent variables changed from one subject to another. 


Exercise set C 


1. If X is N(u, 07), show that u is estimable and o7 is identifiable. 


2. Suppose X1, X2, and X3 are independent normal random variables. 
Each has variance 1. The means area, +98, and œ +9982, respectively. 
Are a and $ identifiable? estimable? 


3. Suppose Y = Xß + €, where X is a fixed n x p matrix, B isa p x 1 
parameter vector, the e; are IID with mean 0 and variance o*. Is B 
identifiable if the rank of X is p? if the rank of X is p — 1? 


4. Suppose Y = X$ + €, where X is a fixed n x p matrix of rank p, and 
f is a px 1 parameter vector. The e; are independent with common 
variance o? and E(e;) = ui, where jz is ann x 1 parameter vector. Is 6 
identifiable? 


5. Suppose X; and X2 are IID, with P,(X; = 1) = p and P,(X; = 0) = 
1 — p; the parameter p is between 0 and 1. Is p identifiable? estimable? 


6. Suppose U and V are independent; U is N (0, o?) and V is N(0, t”, 
where o? and t? are parameters. However, U and V are not observable. 
Only U + V is observable. Is o? + t? identifiable? How about 07? 17? 


7. If X is distributed like the absolute value of an N (u, 1) variable, show 
that: 


(a) |u] is identifiable. Hint: what is E(X*)? 


(b) u itself is not identifiable. Hint: u and — u lead to the same distri- 
bution for X. 


8. For incredibly many bonus points: suppose X is N (u, 07). Is |u] es- 
timable? What about o°? Comments. We only have one observation 
X, not many observations. A rigorous solution to this exercise might 
involve the dominated convergence theorem, or the uniqueness theorem 
for Laplace transforms. 
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7.3 Logit models 


Logits are often used instead of probits. The specification is the same, 
except that the logistic distribution function A is used instead of the normal ®: 


x 


e 
for — œ <x < œ. 
1+e* 


A(x) = 


The odds ratio is p/(1 — p). People write logit for the log odds ratio: 


P 


for O0< p< 1. 
l-p 


logit p = log 


The logit model says that the response variables Y; are independent given the 
covariates X, and P(Y; = 1|X) = A(X; ß), that is, 


logit P(Y; = 1|X) = X;£. 
(See exercise 6 below.) From the latent-variables perspective, 
Y; = 1 if X; + Ui > 0, but Y; = 0 if X; + Ui <0. 


The latent variables U; are independent of the covariate matrix X, and the 
U; are IID, but now the common distribution function of the U; is A. The 
logit model uses A where the probit uses ®. That’s the difference. “Logistic 
regression” is a synonym for logit models. 


Exercise set D 


1. Suppose the random variable X has a continuous, strictly increasing dis- 
tribution function F. Show that F(X) is uniform on [0,1]. Hints. Show 
that F has a continuous, strictly increasing inverse F7}. So F(X) < y 
if and only if X < F-'(y). 

2. Conversely, if U is uniform on [0,1], show that F —!(U) has distribution 
function F. (This idea is often used to simulate IID picks from F.) 


On the logit model 
3. Check that the logistic distribution function A is monotone increasing. 
Hint: if 1 — A is decreasing, you’re there. 
Check that A(—oo) = 0 and A (œ) = 1. 
5. Check that the logistic distribution is symmetric, i.e., 1 — A(x) = 
A(— x). Appearances can be deceiving... . 
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6. 


10. 


11. 


(a) If P(Yi = 1|X) = A(X; £), show that logit P (Y; = 1|X) = XiB. 
(b) If logit P(Y; = 1|X) = X;ß, show that P(Y; = 1|X) = A(X; 2). 
What is the distribution of log U — log (1 — U), where U is uniform 
on [0, 1]? Hints. Show that log u — log (1 — u) is a strictly increasing 
function of u. Then compute the chance that log U — log (1 — U) > x. 
For 6 > 0, suppose X has the density 8 / (0 +x)? on the positive half-line 
(0, oo). Show that log(X/0) has the logistic distribution. 


Show that g(x) = — log(1 + e*) is strictly concave on (—oo, oo). Hint: 
check that g” (x) = —e*/(1 + e*)* < 0. 

Suppose that, conditional on the covariates X, the Y’s are independent 
0-1 variables, with logit P(Y; = 1|X) = X;f, i.e., the logit model 
holds. Show that the log likelihood function can be written as 


La (£) = (Set tewo) + (Sox) 
i=l 


i=1 


(This continues exercises 9 and 10: hard.) Show that Ln (£) is a concave 
function of 6, and strictly concave if X has full rank. Hints. Let the 
parameter vector B be p x 1. Let c be a p x 1 vector with ||c|| > 
0. You need to show c’L’'(B)c < 0, with strict inequality if X has 
full rank. Let X; be the ith row of X, a 1 x p vector. Confirm that 
Li(B) = 30, X; Xo" (X;B), where y was defined in exercise 9. Check 
that c’X;X,c > 0 and g”(X;B) < m < 0 for alli = 1,...,n, where m 
is a real number that depends on £. 


On the probit model 


12. 


13. 


Let ® be the standard normal distribution function (mean 0, variance 1). 
Let @ = ©’ be the density. Show that ¢’(x) = —x@(x). If x > 0, show 
that 


J zġ(z)dz = ġ(x) and 1-0% < | = (2) dz. 


x 
Conclude that 1 — ®(x) < (x)/x for x > 0. If x < 0, show that 
P(x) < b(x)/|x|. Show that log ® and log(1 — ®) are strictly concave, 
because their second derivatives are strictly negative. Hint: do the cases 
x > Oand x < 0 separately. 
(This continues exercise 12: hard.) Show that the log likelihood for the 
probit model is concave, and strictly concave if X has full rank. Hint: 
this is like exercise 11. 
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7.4 The effect of Catholic schools 


Catholic schools in the United States seem to be more effective than 
public schools. Graduation rates are higher and more of the students get into 
college. But maybe this is because of student characteristics. For instance, 
richer students might be more likely to go to Catholic schools, and richer kids 
tend to do better academically. That could explain the apparent effect. Evans 
and Schwab (1995) use a probit model to adjust for student characteristics 
like family income. They use a two-equation model to adjust for selection 
effects based on unmeasured characteristics, like intelligence and motivation. 
For example, Catholic schools might look better because they screen out less- 
intelligent, less-motivated students; or, students who are more intelligent and 
better motivated might self-select into Catholic schools. (The paper, reprinted 
at the back of the book, rejects these alternative explanations.) 

Data are from the “High School and Beyond” survey of high schools. 
Evans and Schwab look at students who were sophomores in the original 1980 
survey and who responded to followup surveys in 1982 and 1984. Students 
who dropped out are excluded. So are a further 389 students who attended 
private non-Catholic schools, or whose graduation status was unknown. That 
leaves 13,294 students in the sample. Table 1 in the paper summarizes the 
data: 97% of the students in Catholic schools graduated, compared to 79% 
in public schools—an impressive difference. 

Table 1 also demonstrates potential confounding. For instance, 79% of 
the students in Catholic schools were Catholic, compared to 29% in pub- 
lic schools—not a huge surprise. Furthermore, 14% had family incomes 
above $38,000, compared to 7% in public schools. (These are 1980s dol- 
lars; $38,000 then was equivalent to maybe $80,000 at the beginning of the 
21st century.) A final example: 2% of the students in Catholic schools were 
age 19 and over, compared to 8% in public schools. Generally, however, 
confounding is not prominent. Table 2 has additional detail on outcomes by 
school type. The probit results are in table 3. The bottom line: confounding 
by measured variables does not seem to explain the different success rates for 
Catholic schools and public schools. The imbalance in religious affiliation 
will be taken up separately, below. 

To define the model behind table 3 in Evans and Schwab, let the response 
variable Y; be 1 if student i graduates, otherwise Y; is 0. Given the covariates, 
the model says that graduation is independent across students. For student i, 


(1) P(Y; = 1|C, X) = O(Cja + XP), 


where C; = 1 if student i attends Catholic school, while C; = 0 if student i 
attends public school. Next, X; is a vector of dummy variables describing per- 
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sonal characteristics of student i—gender, race, ethnicity, family income... . 
The matrix X on the left hand side of (1) has a row for each student and a 
column for each variable: X; is the ith row of X. Similarly, C is the vector 
whose ith component is C;. As usual, ® is the standard normal distribution 
function. The parameters «œ and £ are estimated by maximum likelihood: a 
is a scalar, and £ is a vector. (We’re not using the same notation as the paper.) 

For Evans and Schwab, the interesting parameter in (1) is a, which 
measures the effect of the Catholic schools relative to the public schools, 
all else equal—gender, race, etc. (It is the assumptions behind the model 
that do the equalizing; “all else equal” is not a phrase to be treated lightly.) 
The Catholic-school effect on graduation is positive and highly significant: 
å = 0.777, with an SE of 0.056, so t = 0.777/0.056 = 14. (See table 3; 
a t-statistic of 14 is out of sight, but remember, it’s a big sample.) The SE 
comes from the observed information, [- L” (&, By. 

For each type of characteristic, effects are relative to an omitted category. 
(If you put in all the categories all the time, the design matrix will not have 
full rank and parameters will not be identifiable.) For example, there is a 
dummy variable for attending Catholic schools, but no dummy variable for 
public schools. Attending public school is the omitted category. The effect 
of attending Catholic schools is measured relative to public schools. 

Family income is represented in the model, but not as a continuous 
variable. Instead, there is a set of dummies to describe family income— 
missing, below $7000, $7000-$12,000, .... (Respondents ticked a box on 
the questionnaire to indicate a range for family income; some didn’t answer 
the question.) For each student, one and only one of the income dummies 
kicks in and takes the value 1; the others are all 0. The omitted category in 
table 3 is $38,000+. You have to look back at table 1 in the paper to spot the 
omitted category. 

A student whose family income is missing has a smaller chance of grad- 
uating than a student whose family income is $38,000-+, other things being 
equal. The difference is —0.111 on the probit scale: you see —0.111 in the 
“probit coefficient” column for the dummy variable “family income missing” 
(Evans and Schwab, table 3). The negative sign should not be a surprise. Gen- 
erally, missing data is bad news. Similarly, a student whose family income is 
below $7000 has a smaller chance of graduating than a student whose family 
income is $38,000-+, other things being equal. The difference is —0.300 on 
the probit scale. The remaining coefficients in table 3 can be interpreted in a 
similar way. 

“Marginal effects” are reported in table 3 of the paper. For instance, the 
marginal effect of Catholic schools is obtained by taking the partial derivative 
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of ®(C;a + Xi) with respect to C;: 


ð A z s AA 
(2a) aC ie TP) = o(Cjia + X: p)à, 
l 
where ¢ = ®’ is the standard normal density. The marginal effect of the jth 
component of X; is the partial derivative with respect to X;;: 


(2b) (Cia + Xi B)B;. 


But ¢(Ci& + Xi) depends on C;, X;. So, which values do we use? See 
note 10 in the paper. We’re talking about a 17-year-old white female, living 
with both parents, attending a public school. ... 

Marginal effects are interpretable if you believe the model, and the vari- 
ables are continuous. Even if you take the model at face value, however, there 
is a big problem for categorical variables. Are you making female students a 
little more female? Are you making public schools a tiny bit Catholic?? 

The average treatment effect (at the end of table 3) is 


| been R 7 ` 
(3) Daa 


The formula compares students to themselves in two scenarios: (i) attends 
Catholic school, (ii) attends public school. You take the difference in grad- 
uation probabilities for each student. Then you average over the students in 
the study: students are indexed by i = 1,..., 7. 

For each student, one scenario is factual; the other is counter-factual. 
After all, the student can’t go to both Catholic and public high schools— 
at least, not for long. Graduation is observed in the factual scenario only. 
The calculation does not use observable outcomes. Instead, the calculation 
uses probabilities computed from the model. This is OK if the model can be 
trusted. Otherwise, the numbers computed from (3) don’t mean very much. 


Latent variables 


Equation (1) is equivalent to the following. Student i will graduate if 
(4) Cia + Xib + Vi > 0; 


otherwise, i does not graduate. Remember, Y; = 1 in the first case, and 0 
in the second. Often, the recipe gets shortened rather drastically: Y; = 1 
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if Cia + Xi + Vi > 0, else Y; = 0. Given the C’s and X’s, the latent 
(unobservable) variables V; are assumed to be IID N(O, 1) across subjects. 
Latent variables are supposed to capture the effects of unmeasured variables 
like intelligence, aptitude, motivation, parental attitudes. Evans and Schwab 
derive equation (1) above from (4), but the “net benefit” talk justifying their 
version of (4) is, well, just talk. 


Response schedules 


Evans and Schwab treat Catholic school attendance along with sex, 
race, ... as manipulable. This makes little sense. Catholic school atten- 
dance might be manipulable, but many other measured variables are personal 
characteristics that would be hard to change. 

Apart from the measured covariates X;, student i has the latent variable 
V; introduced above. The response schedule behind (4) is this. Student i 
graduates if 


(5) ca+ Xip + Vi > 0; 


otherwise, no graduation. Here, c can be set to 0 (send the kid to public school) 
or l (send to Catholic school). Manipulating c doesn’t affect a, 6, Xi, Vi— 
which is quite an assumption. 

There are also statistical assumptions: 


(6) V; are IID N(O, 1) across students i, 
(7) the V’s are independent of the C’s and X’s. 


If (7) holds, then Nature is randomizing students to different combinations 
of C and X, independently of their V’s—another strong assumption. 

There is another way to write the response schedule. Given the covariate 
matrix X, the conditional probability that i graduates is ® (cæ + X; ß). This 
function of c says what the graduation probability would be if we intervened 
and set c to 0. The probability would be ®(X;6). The function also says 
what the graduation probability would be if we intervened and set c to 1. The 
probability would be (œ + X;8). The normal distribution function ® is 
relevant because—by assumption—the latent variable V; is N (0, 1). 

The response schedule is theory. Nobody intervened to set c. High 
School and Beyond was a sample survey, not an experiment. Nature took its 
course, and the survey recorded what happened. Thus, C; is the value for c 
chosen by Nature for student i. 
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The response schedule may be just theory, but it’s important. The theory 
is what bridges the gap between association and causation. Without (5), it 
would be hard to draw causal conclusions from observational data. With- 
out (6) and (7), the statistical procedures would be questionable. Parameter 
estimates and standard errors might be severely biased. 

Evans and Schwab are concerned that C may be endogenous, that is, re- 
lated to V. Endogeneity would bias the study. For instance, Catholic schools 
might look good because they select good students. Evans and Schwab offer 
a two-equation model—our next topic—to take care of this problem. 


The second equation 


The two-equation model is shown in figure 1. The first equation—in its 
response-schedule form—says that student i graduates if 
(8) ca + Xib + Vi > 0; 
otherwise, no graduation. This is just (5), repeated for convenience. 

We could in principle set c to 1, i.e., put the kid in Catholic school. Or, 
we could set c to 0, i.e., put him in public school. In fact, Nature chooses 
c. Nature does it as if by using the second equation in the model. That’s the 
novelty. 

To state the second equation, let IsCat; = 1 if student i is Catholic, else 
IsCat; = 0. Then student i attends Catholic school (C; = 1) if 
(9) IsCat;a + X;b + U; > 0; 
otherwise, public school (C; = 0). Equation (9) is the second equation in the 
model: a is a new parameter, and b is a new parameter vector. 

Nature proceeds as if by generating C; from (9), and substituting this C; 
for c in (8) to decide whether student i graduates. That is what ties the two 
equations together. The latent variables U; and V; in the two equations might 
be correlated, as indicated by the dashed curve in figure 1. The correlation is 
another new parameter, denoted p. 

The statistical assumptions in the two-equation model are as follows. 


(10) (Ui, Vi) are IID, as pairs, across students i. 


(11) (Ui, Vi) are bivariate normal; U; has mean 0 and variance 1; so 
does V;: the correlation between U; and V; is p. 


(12) The U’s and V’s are independent of the IsCat’s and X’s. 


Condition (12) makes IsCat and X exogenous (sections 6.4-5). The correla- 
tion p in (11) is a key parameter. If = 0, then C; is independent of V; and 
we don’t need the second equation after all. If o A 0, then C; is dependent 
on V;, because V; is correlated with U;, and U; comes into the formula (9) 
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Figure 1. The two-equation model. 


U, V Correlated N(O,1) latent variables U -------~- Vv 
IsCat Is Catholic 

C Goes to Catholic high school 

Y Graduates from high school 

X Control variables: 


gender, race, ethnicity. ... 


e > e >e 


IsCat C Y 


that determines C;. So, assumption (7) in the single-equation model breaks 
down. The two-equation model (also called the “bivariate probit”) is supposed 
to take care of the breakdown. That is the whole point of the second equation. 

This isn’t a simple model, so let’s guide Nature through the steps she 
has to take in order to generate the data. (Remember, we don’t have access 
to the parameters a, 6, a, b, or p—but Nature does.) 


l. 
2: 


6. 
7. 


Choose IsCat; and X;. 

Choose (U;, V;) from a bivariate normal distribution, with mean 0, 
variance 1, and correlation o. The (U;, V;) are independent of the 
IsCat’s and X’s. They are independent across students. 


. Check to see if inequality (9) holds. If so, set C; to 1 and send student 


i to Catholic school. Else set C; to 0 and send i to public school. 


. Set c in (8) to Cj. 
. Check to see if inequality (8) holds. If so, set Y; to 1 and make student 


i graduate. Else set Y; to 0 and prevent i from graduating. 
Reveal IsCat;, X;, Ci, Yi. 
Shred U; and V;. (Hey, they’re latent.) 


Evans and Schwab want to have at least one exogenous variable that 
influences C but has no direct influence on Y. That variable is called an 
“instrument” or “instrumental variable.” Here, IsCat is the instrument: it is 
1 if the student is Catholic, else 0. IsCat comes into the model (9) for choos- 
ing schools, but is excluded, by assumption, from the graduation model (8). 
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Economists call this sort of assumption an “exclusion restriction” or an “iden- 
tifying restriction” or a “structural zero.” In figure 1, there is no arrow from 
IsCat to Y. That is the graphical tipoff to an exclusion restriction. 

The exogeneity of IsCat and X is a key assumption. In figure 1, there are 
no arrows or dotted lines connecting IsCat and X to U and V. That is how the 
graph represents exogeneity. Without exogeneity assumptions and exclusion 
restrictions, parameters are seldom identifiable; there are more examples in 
chapter 9. (The figure may be misleading in one respect: IsCat is correlated 
with X, although perfect collinearity is excluded.) 

The two-equation model—equations (8) and (9), with assumptions (10)- 
(11)-(12) on the latent variables—is estimated by maximum likelihood. Re- 
sults are shown in line (2), table 6 of the paper. They are similar—at least for 
school effects—to the single-equation model (table 3). This is because the 
estimated value for p is negligible. 

Exogeneity. This term has several different meanings. Here, we use 
it in a fairly weak sense: exogenous variables are independent of the latent 
variables. By contrast, endogenous variables are dependent on the latent 
variables. Technically, exogeneity has to be defined relative to a model, which 
makes the concept even more confusing. For example, take the two-equation 
model (8)-(9). In this model, C is endogenous, because it is influenced by 
the latent U. In (4), however, C could be exogenous: if p = 0, then C IL V. 
We return to endogeneity in chapter 9. 


Mechanics: bivariate probit 


In this section, we’ll see how to write down the likelihood function 
for the bivariate probit model. Condition on all the exogenous variables, 
including IsCat. The likelihood function is a product, with one factor for 
each student. That comes from the independence assumptions, (10) and (12). 
Take student i. There are 2 x 2 = 4 cases to consider: C; = 0 or 1, and 
Y; =Oor 1. 

Let’s start with C; = 1, Y; = 1. These are facts about student i recorded 
in the High School and Beyond survey, as are the values for IsCat; and X;; 
what you won’t find on the questionnaire is U; or V;. We need to compute the 
chance that C; = 1 and Y; = 1, given the exogenous variables. According to 
the model—see (8) and (9)—C; = 1 and Y; = 1 if 


Ui > —IsCat;a — X;b and V; > —a — X;ß. 
So the chance that C; = 1 and Y; = 1 is 


(13) P{U; > —IsCatja — Xib and V; > —a — X;f}. 
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The kid contributes the factor (13) to the likelihood. Notice that œ appears 
in (13), because C; = 1. 

Let’s do one more case: C; = 0 and Y; = 1. The model says that C; = 0 
and Y; = lif 


Ui < —IsCat;a — X;b and V; > —X;8. 
So the chance is 
(14) P{U; < —IsCatja — X;b and V; > —X;f}. 


This kid contributes the factor (14) to the likelihood. Notice that æ does not 
appear in (14), because C; = 0. The random elements in (13)-(14) are the 
latent variables U; and V;, while IsCat; and X; are treated as data: remember, 
we conditioned on the exogenous variables. 

Now we have to evaluate (13) and (14). Don’t be hasty. Multiplying 
chances in (13), for instance, would not be a good idea—because of the 
correlation between U; and V;: 


P{U; > —IsCatja — Xib and Vi > —a — Xip} # 
P{U; > —IsCatja — Xib} ° P{V; > —a — X;B}. 
The probabilities can be worked out from the bivariate normal density— 


assumption (11). The formula will involve p, the correlation between U; and 
Vi. The bivariate normal density for (U;, Vi) is 


1 2-2 z 
(5) $v) = exp CAE | 
27x41 — p? 2(1 — p*) 
(This is a special case of the formula in theorem 3.2: the means are 0 and the 
variances are 1.) So the probability in (13), for example, is 


o0 oo 
J J (u, v) dudv. 
a—X;B IsCat;a—X;,b 


The integral cannot be done in closed form by calculus. Instead, we 
would have to use numerical methods (“quadrature”) on the computer. See the 
chapter end notes for hints and references. After working out the likelihood, 
we would have to maximize it—which means working it out a large number 
of times. All in all, the bivariate probit is a big mess to code from scratch. 
There is software that tries to do the whole thing for you, e.g., biprobit in 
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STATA, proc qlim in SAS, or vglm in the VGAM library for R. However, 
finding maxima in high-dimensional spaces is something of a black art; and 
the higher the dimensionality, the blacker the art. 


Why a model rather than a cross-tab? 


Tables 1 and 3 of Evans and Schwab have 2 sexes, 3 racial groups 
(white, black, other), 2 ethnicities (Hispanic or not), 8 income categories, 
5 educational levels, 5 types of family structure, 4 age groups, 3 levels of 
attending religious service. The notes to table 3 suggest 3 place types (urban, 
suburban, rural) and 4 regions (northeast, midwest, south, west). That makes 


2x3x2x8x5x5x4x3x3x4 = 345,600 


types of students. Each student might or might not be Catholic, and might or 
might not attend Catholic school, which gives another factor of 2x2 = 4. 
Even with a huge sample, a cross-tab can be very, very sparse. A probit 
model like equation (1) enables you to handle a sparse table. This is good. 
However, the model assumes—without warrant—that probabilities are linear 
and additive (on the probit scale) in the selected variables. Bad. 

Let’s look more closely at linearity and additivity. The model assumes 
that income has the same effect at all levels of education. Effects are the 
same for all types of families, wherever they live. And so forth. Especially, 
Catholic schools have the same additive effect (on the probit scale) for all 
types of students. 

Effects are assumed to be constant inside each of the bins that define a 
dummy variable. For instance, “some college” is a bin for parent education 
(Evans and Schwab, table 3). According to the model, one year of college 
for the parents has the same effect on graduation rates as would two years of 
college. Similar comments apply to the other bins. 


Interactions 


To weaken the assumptions of linearity and additivity, people some- 
times put interactions into the model. Interactions are usually represented 
as products. With dummy variables, that’s pretty simple. For instance, the 
interaction of a dummy variable for male and a dummy for white gives you 
a dummy for male whites. A “three-way interaction” between male, white, 
and Hispanic gives you a dummy for male white Hispanics. And so forth. 

If x, z, and the interaction term xz go into the model as explanatory 
variables, and you intervene to change x, you need to think about how the 
interaction term will change when x changes. This will depend on the value 
of z. The whole point of putting the interaction term into the equation was to 
get away from linearity and additivity. 
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If you put in all the interactions, you’re back in the cross-tab, and don’t 
have nearly enough data. With finer categories, there could also be a shortage 
of data. In effect, the model substitutes assumptions (e.g., no interactions) 
for data. If the assumptions are good, we’re making progress. Otherwise, we 
may only be assuming that progress has been made. Evans and Schwab test 
their model in several ways, but with 13,000 observations and a few hundred 
thousand possible interactions, power is limited. 


More on table 3 in Evans and Schwab 


A lot of the coefficient estimates make sense. For instance, the probabil- 
ity of a successful outcome goes up with parental education. The probability 
of success is higher if the family is intact. And so forth. Some of the re- 
sults are puzzling. Were blacks and Hispanics more likely to graduate in the 
1980s, after controlling for the variables in table 3 of the paper? Compare, 
e.g., Jencks and Phillips (1998). It is also hard to see why there is no income 
effect on graduation beyond $20,000 a year, although there is an effect on 
attending college. (The results in table 2 weaken this objection; the problems 
with income may be in the data.) It is unclear why the test scores discussed in 
table 2 are excluded from the model. Indeed, many of the variables discussed 
in Coleman et al (1982) are ignored by Evans and Schwab, for reasons that 
are not explained. 

Coleman et al (1982, pp. 8, 103-15, 171-78) suggest that a substantial 
part of the difference in outcomes for students at Catholic and public schools is 
due to differences in the behavior of student peer groups. If so, independence 
of outcomes is in question. So is the basic causal model, because changing 
the composition of the student body may well change the effectiveness of the 
school. Then responses depend on the treatment of groups not the treatment 
of individuals, contradicting the model. (Section 6.5 discusses this point for 
regression.) Evans and Schwab have a partial response to problems created 
by omitted variables and peer groups: see table 4 in their paper. 


More on the second equation 


What the second equation is supposed to do is to take care of a possible 
correlation between attending Catholic school and the latent variable V in (8). 
The latent variable represents unmeasured characteristics like intelligence, 
aptitude, motivation, parental attitudes. Such characteristics are liable to be 
correlated with some of the covariates, which are then endogenous. Student 
age is a covariate, and a high school student who is 19+ is probably not 
the most intelligent and best motivated of people. Student age is likely to 
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be endogenous. So is place of residence, because many parents will decide 
where to live based on the educational needs of their children. These kinds of 
endogeneity, which would also bias the MLE, are not addressed in the paper. 

There was a substantial non-response rate for the survey: 30% of the 
sample schools refused to participate in the study. If, e.g., low-achieving 
Catholic schools are less likely to respond than other schools, the effect of 
Catholic schools on outcomes will be overstated. If low-achieving public 
schools are the missing ones, the effect of Catholic schools will be understated. 

Within participating schools, about 15% of the students declined to re- 
spond in 1980. There were also dropouts—students in the 1980 survey but 
not the 1982/1984 followup. The dropout rate was in the range 10%—20%. 
In total, half the data are missing. If participation in the study is endogenous, 
the MLE is biased. The paper does not address this problem. 

There is a troublesome exclusion restriction: IsCat is not used as an 
explanatory variable in the graduation model. Evans and Schwab present 
alternative specifications to address some of the modeling issues. In the end, 
however, there remain a lot of question marks. 


Exercise set E 


1. Intable3 of Evans and Schwab, is 0.777 a parameter or an estimate? How 
is this number related to equation (1)? Is this number on the probability 
scale or the probit scale? Repeat for 0.041, in the FEMALE line of the 
table. (The paper is reprinted at the back of the book.) 

2. What does the —0.204 for PARENT SOME COLLEGE in table 3 mean? 

3. Here is the two-equation model in brief: student i goes to Catholic school 
(C; = lif 

IsCat;a + X;b+ Ui > 0, 
and graduates if 
Cia+ Xib + Vi > 0. 
(a) Which parameter tells you the effect of Catholic schools? 
(b) The U; and V; are variables. Options (more than one 
may be right): 
data random latent dummy observable 


(c) What are the assumptions on U; and V;? 

4. In line (2) of table 6 in Evans and Schwab, is 0.859 a parameter or an 
estimate? How is it related to the equations in exercise 3? What about 
the —0.053? What does the —0.053 tell you about selection effects in 
the one-equation model? 
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10. 


11. 


In the two-equation model, the log likelihood function is a ; 
with one for each . Fill in the blanks using one of 
the options below, and explain briefly. 
sum product quotient matrix term 
factor entry student school variable 


Student #77 is Presbyterian, went to public school, and graduated. What 
does this subject contribute to the likelihood function? Write your an- 
swer using @ in equation (15). 

Student #4039 is Catholic, went to public school, and failed to graduate. 
What does this subject contribute to the likelihood function? Write your 
answer using ¢ in equation (15). 

Does the correlation between the latent variables in the two equations 
turn up in your answers to exercises 6 and 7? If so, where? 

Table 1 in Evans and Schwab shows the total sample as 10,767 in the 
Catholic schools and 2527 in the public schools. Is this reasonable? 
Discuss briefly. 

Table 1 shows that 0.97 of the students at Catholic schools graduated. 
Underneath the 0.97 is the number 0.17. What is this number, and how 
is it computed? Comment briefly. 

For bonus points: suppose the two-equation model is right, and you had 
areally big sample. Would you get accurate estimates for œ? p? the V;? 


7.5 Discussion questions 


Some of these questions cover material from previous chapters. 


l. 
2: 


Is the MLE biased or unbiased? 


In the usual probit model, are the response variables independent from 
one subject to another? Or conditionally independent given the explana- 
tory variables? Do the explanatory variables have to be statistically 
independent? Do they have to be linearly independent? Explain briefly. 


Here is the two-equation model of Evans and Schwab, in brief. Student i 
goes to Catholic school (C; = 1) if 


IsCat;a + Xib + Ui > 0, (selection) 
otherwise C; = 0. Student i graduates (Y; = 1) if 
Cia+ Xib + Vi > 0, (graduation) 


otherwise Y; = 0. IsCat; is 1 if i is Catholic, and 0 otherwise; X; 
is a vector of dummy variables describing subject i’s characteristics, 
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including gender, race, ethnicity, family income, and so forth. Evans 
and Schwab estimate the parameters by maximum likelihood, finding 
that @ is large and highly significant. True or false and explain— 


(a) 


(b) 


(c) 


(d) 


The statistical model makes a number of assumptions about the 
latent variables. 


However, the parameter estimates and standard errors are computed 
from the data. 


The computation in (b) can be done whether or not the assumptions 
about the latent variables hold true. Indeed, the computation uses 
IsCat;, Xi, Ci, Y; fori = 1, ..., n and the bivariate normal density 
but does not use the latent variables themselves. 


Therefore, the statistical calculations in Evans and Schwab are fine, 
even if the assumptions about the latent variables are not true. 


4. To what extent do you agree or disagree with the following statements 
about the paper by Evans and Schwab? 


(a) 


(b) 


(c) 


(d) 
(e) 
(f) 


The paper demonstrates causation using the data: Catholic schools 
have an effect on student graduation rates, other things being equal. 


The paper assumes causation: Catholic schools have an effect on 
student graduation rates, other things being equal. The paper as- 
sumes a specific functional form to implement the idea of causation 
and other things being equal—the probit model. The paper uses 
the data to estimate the size of the Catholic school effect. 


The graduation equation tests for interactions among explanatory 
variables in the selection equation. 


The graduation equation assumes there are no interactions. 
The computer derives the bivariate probit model from the data. 


The computer is told to assume the bivariate probit model. What 
the computer derives from the data is estimates for parameters in 
the model. 


5. Suppose high school students work together in small groups to study 
the material in the courses. Some groups have a strong positive effect, 
helping the students get on top of the course work. Some groups have 
a negative effect. And some groups have no effect. Are study groups 
consistent with the model used by Evans and Schwab? If not, which 
assumptions are contradicted? 


6. Powers and Rock (1999) consider a two-equation model for the effect 
of coaching on SAT scores: 


MAXIMUM LIKELIHOOD 143 


Xi =1 if Uja+6; >0, else X; = 0; (assignment) 
Y; = cXi + Vip + oci. (response) 


Here, X; = 1 if subject i is coached, else X; = 0. The response 
variable Y; is subject i’s SAT score; U; and V; are vectors of personal 
characteristics for subject i, treated as data. The latent variables (6;, €;) 
are IID bivariate normal with mean 0, variance 1, and correlation p; 
they are independent of the U’s and V’s. (In this problem, U and V are 
observable, ô and € are latent.) 


(a) Which parameter measures the effect of coaching? How would you 
estimate it? 

(b) State the assumptions carefully (including a response schedule, if 
one is needed). Do you find the assumptions plausible? 

(c) Why do Powers and Rock need two equations, and why do they 
need p? 

(d) Why can they assume that the disturbance terms have variance 1? 

Hint: look at sections 7.2 and 7.4. 


7. Shaw (1999) uses a regression model to study the effect of TV ads and 
candidate appearances on votes in the presidential elections of 1988, 
1992, and 1996. With three elections and 51 states (DC counts for this 
purpose), there are 153 data points, i.e., pairs of years and states. Each 
variable in the model is determined at all 153 points. In a given year 
and state, the volume TV of television ads is measured in 100s of GRPs 
(gross rating points). Rep.7V, for example, is the volume of TV ads 
placed by the Republicans. AP is the number of campaign appearances 
by a presidential candidate. UN is the percent undecided according to 
tracking polls. PE is Perot’s support, also from tracking polls. (Ross 
Perot was a maverick candidate.) RS is the historical average Republi- 
can share of the vote. There is a dummy variable D1992, which is | in 
1992 and 0 in the other years. There is another dummy D1996 for 1996. 
A regression equation is fitted by OLS, and the Republican share of the 
vote is 


— 0.326 — 2.324x Dj992 — 5.001 x D1996 

+ 0.430 x (Rep. TV — Dem. TV) + 0.766x (Rep. AP — Dem. AP ) 

+ 0.066 x (Rep. TV — Dem. TV ) x (Rep. AP — Dem. AP ) 

+ 0.032 x (Rep. TV — Dem. TV)x UN + 0.089 x (Rep. AP — Dem. AP ) x UN 
+ 0.006 x (Rep. TV — Dem. TV) x RS + 0.017x (Rep. AP — Dem. AP) x RS 
+ 0.009x UN + 0.002xPE + 0.014xRS + error. 
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(a) What are dummy variables, and why might Dj992 be included in 
the equation? 

(b) According to the model, if the Republicans buy another 500 GRPs 
in a state, other things being equal, will that increase their share 
of the vote in that state by 0.430 x5 = 2.2 percentage points? 
Answer yes or no, and discuss briefly. (The 0.430 is the coefficient 
of Rep. TV — Dem. TV in the second line of the equation.) 


The Nurses’ Health Study wanted to show that hormone replacement 
therapy (HRT) reduces the risk of heart attack for post-menopausal 
women. The investigators found out whether each woman experienced a 
heart attack during the study period, and her HRT usage: 6,224 subjects 
were on HRT and 27,034 were not. For each subject, baseline mea- 
surements were made on potential confounders: age, height, weight, 
cigarette smoking (yes or no), hypertension (yes or no), and high choles- 
terol level (yes or no). 


(a) If the investigators asked you whether to use OLS or logistic re- 
gression to explain the risk of heart attack in terms of HRT usage 
(yes/no) and the confounders, what would be your advice? Why? 


(b) State the model explicitly. What is the design matrix X? n? p? 
How will the yes/no variables be represented in the design matrix? 
What is Y? What is the response schedule? 

(c) Which parameter is the crucial one? 

(d) Would the investigators hope to see a positive estimate or a nega- 
tive estimate for the crucial parameter? How can they determine 
whether the estimate is statistically significant? 

(e) What are the key assumptions in the model? 

(f) Why is a model needed in the first place? a response schedule? 

(g) To what extent would you find the argument convincing? Discuss 
briefly. 

Comment. Details of the study have been changed a little for purposes 

of this question; see chapter end notes. 


People often use observational studies to demonstrate causation, but 
there’s a big problem. What is an observational study, what’s the prob- 
lem, and how do people try to get around it? Discuss. If possible, give 
examples to illustrate your points. 


There is a population of N subjects, indexed by i = 1,..., N. Each 
subject will be assigned to treatment T or control C. Subject i has a 
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response yp if assigned to treatment and yf if assigned to control. Each 
response is 0 (“failure”) or 1 (“success”). For instance, in an experiment 
to see whether aspirin prevents death from heart attack, survival over the 
followup period would be coded as 1, death would be coded as 0. If you 
assign subject i to treatment, you observe yi but not yE . Conversely, 
if you assign subject i to control, you observe yf but not y. These 
responses are fixed (not random). 


Each subject i has a 1xp vector of personal characteristics w;, unaffected 
by assignment. In the aspirin experiment, these characteristics might 
include weight and blood pressure just before the experiment starts. 
You can always observe w;. Population parameters of interest are 


Tee tk 
er ae a= g7 cas. 


i=l i=l 


The first parameter is the fraction of successes we would see if all subjects 
were put into treatment. We could measure this directly—by putting all 
the subjects into treatment—but would then lose our chance to learn 
about the second parameter, which is the fraction of successes if all 
subjects were in the control condition. The third parameter is the dif- 
ference between the first two parameters. It measures the effectiveness 
of treatment, on average across all the subjects. This parameter is the 
most interesting of the three. It cannot be measured directly, because 
we cannot put subjects both into treatment and into control. 


Suppose 0 < n < N. Ina “randomized controlled experiment,” n sub- 
jects are chosen at random without replacement and assigned to treat- 
ment; the remaining N — n subjects are assigned to control. Can you 
estimate the three population parameters of interest? Explain. Hint: see 
discussion questions 7-8 in chapter 6. 


(This continues question 10.) The assignment variable X; is defined as 
follows: X; = 1 if i is in treatment, else X; = 0. The probit model says 
that given the assignments, subjects are independent, the probability 
of success for subject i being ®(X;a + w; f), where ® is the standard 
normal distribution function and w; is a vector of personal characteristics 
for subject i. 


(a) Would randomization justify the probit model? 


(b) The logit model replaces ® by A(x) = e*/(1 + e*). Would ran- 
domization justify the logit model? 
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(c) Can you analyze the data without probits, logits, ...? 


Explain briefly. Hint: see discussion questions 9-10 in chapter 6. 


12. Malaria is endemic in parts of Africa. A vaccine is developed to protect 
children against this disease. A randomized controlled experiment is 
done in a small rural village: half the children are chosen at random to 
get the vaccine, and half get a placebo. Some epidemiologists want to 
analyze the data using the setup described in question 10. What is your 
advice? 


13. As in question 12, but this time, the epidemiologists have 20 isolated 
tural villages. They choose 10 villages at random for treatment. In these 
villages, everybody will get the vaccine. The other 10 villages will serve 
as the control group: nobody gets the vaccine. Can the epidemiologists 
use the setup described in question 10? 


14. Suppose we accept the model in question 10, but data are collected on X; 
and Y; in an observational study, not a controlled experiment. Subjects 
assign themselves to treatment (X; = 1) or control (X; = 0), and we 
observe the response Y; as well as the covariates w;. One person suggests 
separating the subjects into several groups with similar w;’s. For each 
group on its own, we can compare the fraction of successes in treatment 
to the fraction of successes in control. Another person suggests fitting a 
probit model: conditional on the X’s and covariates, the probability that 
Y; = 1 is ®(X;a + w;B). What are the advantages and disadvantages 
of the two suggestions? 


15. Paula has observed values on four independent random variables with 
common density fa, g(x) = c(@, B)(ax — By’ exp[—(ax — B)*], where 
a > 0,—co < B < œ, and c(«, B) is chosen so that phe fa p(x)dx = 
1. She estimates œ, by maximum likelihood and computes the stan- 
dard errors from the observed information. Before doing the t-test to 
see whether f is significantly different from 0, she decides to get some 
advice. What do you say? 


16. Jacobs and Carmichael (2002) are comparing various sociological the- 
ories that explain why some states have the death penalty and some do 
not. The investigators have data for 50 states (indexed by i) in years 
t = 1971, 1981, 1991. The response variable Y;, is 1 if state i has the 
death penalty in year t, else 0. There is a vector of explanatory variables 
Xj; and a parameter vector £, the latter being assumed constant across 
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states and years. Given the explanatory variables, the investigators as- 
sume the response variables are independent and 


log[— log P (Yir = 0|X)] = XirB. 


(This is a “complementary log log” or “cloglog” model.) After fitting the 
equation to the data by maximum likelihood, the investigators determine 
that some coefficients are statistically significant and some are not. The 
results favor certain theories over others. The investigators say, 


“All standard errors are corrected for heteroscedasticity by White’s 
method.... Estimators are robust to misspecification because the 
estimates are corrected for heteroscedasticity.” 


(The quote is slightly edited.) “Heteroscedasticity” means, unequal vari- 
ances (section 5.4). White’s method is discussed in the end notes to 
chapter 5: it estimates SEs for OLS when the e’s are heteroscedastic, 
using equation (5.8). “Robust to misspecification” means, works pretty 
well even if the model is wrong. 


Discuss briefly, answering these questions. Are the authors claiming 
that parameter estimates are robust, or estimated standard errors? If 
the former, what do the estimates mean when the model is wrong? If 
the latter, according to the model, is var(Yj;|X) different for different 
combinations of i and t? Are these differences taken into account by the 
asymptotic SEs? Do asymptotic SEs for the MLE need correction for 
heteroscedasticity? 


Ludwig is working hard on a statistics project. He is overheard muttering 
to himself, “Ach! Schrecklich! So many Parameters! So little Data!” 
Is he worried about bias, endogeneity, or non-identifiability? 


Garrett (1998) considers the impact of left-wing political power (LPP) 
and trade-union power (TUP) on economic growth. There are 25 years 


of data on 14 countries. Countries are indexed by i = 1,..., 14; years 
are indexed by ¢t = 1,..., 25. The growth rate for country i in year t is 
modeled as 


ax LPP); + bx TUP;; + cx LPP); x TUP; + XisB + €it, 


where Xj; is a vector of control variables. Estimates for a and b are neg- 
ative, suggesting that right-wing countries grow faster. Garrett rejects 
this idea, because the estimated coefficient c of the interaction term is 
positive. This term is interpreted as the “combined impact” of left-wing 
political power and trade-union power, Garrett’s conclusion being that 
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the country needs both kinds of left-wing power in order to grow more 
rapidly. Assuming the model is right, does c x LPP x TUP measure the 
combined impact of LPP and TUP? Answer yes or no, and explain. 


This continues question 18; different notation is used: part (b) might be 
a little tricky. Garrett’s model includes a dummy variable for each of 
the 14 countries. The growth rate for country i in year t is modeled as 


Qi + Zity + Ein, 


where Z;; is a 1x10 vector of explanatory variables, including LPP, TUP, 
and the interaction. (In question 18, the country dummies didn’t matter, 
and were folded into X.) Beck (2001) uses the same model—except that 
an intercept is included, and the dummy for country #1 is excluded. So, 
in this second model, the growth rate in country i > | and year t is 


a +a + Zity“ + €it; 
whereas the growth rate in country #1 and year t is 
a* + Zyy* + €1. 


Assume both investigators are fitting by OLS and using the same data. 


(a) Why can’t you have a dummy variable for each of the 14 countries, 
and an intercept too? 


(b) Show that y = y*, @, = ĝ*, and a, =a* + a@* fori > 1. 

Hints for (b). Let M be the design matrix for the first model; M*, for 
the second. Find a lower triangular matrix L—which will have 1’s on 
the diagonal and mainly be 0 elsewhere—such that ML = M*. How 
does this relationship carry over to the parameters and the estimates? 


Yule used a regression model to conclude that outrelief causes pauperism 
(section 1.4). He presented his paper at a meeting of the Royal Statistical 
Society on 21 March 1899. Sir Robert Giffen, Knight Commander of 
the Order of the Bath, was in the chair. There was a lively discussion, 
summarized in the Journal of the Royal Statistical Society (Vol. LXII, 
Part II, pp. 287-95). 


(a) According to Professor FY Edgeworth, if one diverged much from 
the law of normal errors, “one was on an ocean without rudder or 
compass”; this normal law of error “was perhaps more universal 
than the law of gravity.” Do you agree? Discuss briefly. 
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(b) According to Sir Robert, practical men who were concerned with 
poor-law administration knew that “if the strings were drawn tightly 
in the matter of out-door relief, they could immediately observe a 
reduction of pauperism itself.” Yule replied, 


“he was aware that the paper in general only bore out conclu- 
sions which had been reached before ... but he did not think 
that lessened the interest of getting an independent test of the 
theories of practical men, purely from statistics. It was an ab- 
solutely unbiassed test, and it was always an advantage in a 
method that it was unbiassed.” 


What do you think of this reply? Is Yule’s test “purely from statis- 
tics”? Is it Yule’s methods that are “unbiassed,” or his estimates of 
the parameters given his model? Discuss briefly. 


7.6 End notes for chapter 7 


Who reads books? Data are available from the August supplement to 
the Current Population Survey of 2002. Also see table 1213 in Statistical 
Abstract of the United States 2008. 


Specification. A “specification” says what variables go into a model, 
what the functional form is, and what should be assumed about the disturbance 
term (or latent variable); if the data are generated some other way, that is 
“specification error” or “misspecification.” 


The MLE. For a more detailed discussion of the MLE, with the outline 
of an argument for theorem 1, see 


http://www.stat.berkeley.edu/users/census/mle.pdf 


There are excellent graduate-level texts by Lehmann (199 lab) and Rao (1973), 
with careful statements of theorems and proofs. Lehmann (2004) might be 
the place to start: fewer details, more explanations. For exponential families, 
the calculus is easier; see, e.g., Barndorff-Nielsen (1980). In particular, there 
is (with minor conditions) a unique max. 

The theory for logits is prettier than for probits, because the logit model 
defines an exponential family. However, the following example shows that 
even in a logit model, the likelihood may not have a maximum: theorems 
have regularity conditions to eliminate this sort of exceptional case. Suppose 
Xi is real and logit P(Y; = 1 | X; = x) = 0x. We have two independent data 
points. At the first, Xı = —1, Yı = 0. At the second, X2 = 1, Y2 = 1. The 
log likelihood function is L(@) = —2 log(1 + e~®), which increases steadily 
with 6. 
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Deviance. In brief, there is a model with p parameters. The null hypoth- 
esis constrains pg of these parameters to be 0. Maximize the log likelihood 
over the full model. Denote the maximum by M. Then maximize the log 
likelihood subject to the constraint, getting a smaller maximum Mo. The 
deviance is 2(M — Mo). If the null hypothesis holds, n is large, and certain 
regularity conditions hold, the deviance is asympotically chi-squared, with po 
degrees of freedom. Deviance is also called the “Neyman-Pearson statistic” 
or the “Wilks statistic.” Deviance is the analog of F (section 5.7), although 
the scaling is a little different. Details are beyond our scope. 


The score test. In many applications, the score test will be more robust. 
The score test uses the statistic 


1 IÔ —l rÂ 

—L'(@)1, L (80), 

n 9 
where 4p is the MLE in the constrained model, and L’ is the partial deriva- 
tive of the log likelihood function: L’ is viewed as a row vector on the left 
and a column vector on the right. The asymptotic distribution under the 
null hypothesis is still chi-squared with po degrees of freedom. Rao (1973, 
pp. 415-20) discusses the various likelihood tests. 


The information matrix. Suppose the X; are IID with density fg. The 
jkth entry in the Fisher information matrix is 


1S BXD Xi) 1 
n 00; 3 fo( Xi)?” 


i=l 


which can be estimated by putting 0 = 6, the MLE. In some circumstances, 
this is easier to compute than observed information, and more stable. With 
endpoint maxima, neither method is likely to work very well. 


Identifiability. A constant function f (0) is identifiable for trivial (and 
irritating) reasons: there are no 01, 02 with f (61) Æ f(@2). Although many 
texts blur the distinction between identifiability and estimability, it seemed 
better to separate them. The flaw in the terminology is this. A parameter may 
not be estimable (no estimator for it is exactly unbiased) but there could still 
exist a very accurate estimator (small bias, small variance). 


A technical side issue. According to our definition, f (@) is identifiable if 
f(@1) Æ f (02) implies Po, A Po,. The informal discussion may correspond 
better to a slightly stronger definition: there should exist a function @ with 
(Po) = f(@); measurability conditions are elided. 

The bigger picture. Many statisticians frown on under-identified mod- 
els: if a parameter is not identifiable, two or more values are indistinguishable, 
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no matter how much data you have. On the other hand, most applied problems 
are under-identified. Identification is achieved only by imposing somewhat 
arbitrary assumptions (independence, constant coefficients, etc.). That is one 
of the central tensions in the field. Efforts have been made to model this 
tension as a bias-variance tradeoff. Truncating the number of parameters 
introduces bias but reduces variance, and the optimal truncation can be con- 
sidered. Generally, however, the analysis takes place in a context that is 
already highly stylized. For discussion, see Evans and Stark (2002). 


Evans and Schwab. The focus is on tables 1-3 and table 6 in the paper. 
In table 6, we consider only the likelihood estimates for line (2); line (1) re- 
peats estimates from the single-equation model. Data from High School and 
Beyond (HS&B) are available, under stringent confidentiality agreements, 
as part of NELS—the National Educational Longitudinal Surveys. The ba- 
sic books on HS&B are Coleman et al (1982), Coleman and Hoffer (1987). 
Twenty years later, these books are still worth reading: the authors had real 
insight into the school system, and the data analysis is quite interesting. Cole- 
man and Hoffer (1987) include several chapters on graduation rates, admis- 
sion to college, success in college, and success in the labor force, although 
Evans and Schwab pay little attention to these data. 

The total sample sizes for students in Catholic and public schools in ta- 
ble 1 of Evans and Schwab appear to have been interchanged. There may be 
other data issues too. See table 2.1 in Coleman and Hoffer (1987), which re- 
ports noticeably higher percentages of students with incomes above $38,000. 
Moreover, table 2 in Evans and Schwab should be compared with Coleman 
and Hoffer (1987, table 5.3): graduation rates appear to be inconsistent. 

Table 1.1 in Coleman et al (1982) shows a realized sample in 1980 of 
26,448 students in public schools, and 2831 in Catholic schools. Evans and 
Schwab have 10,767 in public schools, and 2527 in Catholic schools (after 
fixing their table 1 in the obvious way). The difference in sample size for the 
Catholic schools probably reflects sample attrition from 1980 to 1984, but 
the difference for public schools seems too large to be explained that way. 
Some information on dropout rates can be gleaned from US Department of 
Education (1987). Compare also table 1.1 in Coleman et al (1982) with 
table 2.9 in Coleman and Hoffer (1987). 

Even without the exclusion restriction, the bivariate probit model in 
section 4 may be identifiable; however, estimates are likely to be unstable. 
See Altonji et al (2005), who focus on the exogeneity assumptions in the 
model. Also see Briggs (2004), Freedman and Sekhon (2008). 


The discussion questions. Powers and Rock are using a version of Heck- 
man’s (1976, 1978, 1979) model, as are Evans and Schwab. The model is 
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discussed with unusual care by Briggs (2004). Many experiments have been 
analyzed with logits and probits, for example, Pate and Hamilton (1992). 
In question 7, the model has been simplified a little. The Nurses’ Health 
Study used a Cox model with additional covariates and body mass index 
(weight/height”) rather than height and weight. The 6224 refers to women on 
combined estrogen and progestin; the 27,034 are never-users. See Grodstein 
et al (1996). The experimental evidence shows the observational studies to 
have been quite misleading: Writing Group for the Women’s Health Initiative 
Investigators (2002), Petitti (1998, 2002), Freedman (2008b). 

Question 10 outlines the most basic of the response schedule models. A 
subject has a potential response at each level of treatment (T or C). One of 
these is observed, the other not. It is often thought that models are justified 
by randomization: but see question 11. Question 12 points to a weakness 
in response-schedule models: if a subject’s response depends on treatments 
given to other subjects, the model does not apply. This is relevant to studies 
of school effects. Question 18 looks at the “baseline model” in Garrett (1998, 
table 5.3); some complications in the data analysis have been ignored. 


Quadrature. If f is a smooth function on the unit interval [0,1], we 
can approximate f f(x)dx by IDE, f (4). This method approximates f 
by a step function with horizontal steps; the integral is approximated by the 
sum of the areas of rectangular blocks. The “trapezoid rule” approximates f 
on the interval — 2] by a line segment joining the point (1, f (<")) 
to (4, f(4)). The integral is approximated by the sum of trapezoidal areas. 
This is better, as the diagram illustrates. There are many variations (Simpson’s 
rule, Newton-Cotes methods, etc.). 


ee 


Other numerical methods. Suppose f is a smooth function on the line, 
and we want to find x near x9 with f(x) = 0. “Newton’s method,” also called 
the “Newton-Raphson method,” is simple—and often works. If f (xo) = 0, 
stop. Otherwise, approximate f by the linear function fo(x) = a+b(x— xo), 
where a = f (xo) and b = f’(xq). Solve the linear equation fo = 0 to get 
a new starting point. Iterate. There are many variations on this idea. If you 
want to read more about numerical methods, try— 


Acton FS (1997). Numerical Methods That Work. Mathematical Asso- 
ciation of America. 
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Atkinson K (2005). Elementary Numerical Analysis. Wiley, 3rd ed. 


Epperson JF (2007). An Introduction to Numerical Methods and Anal- 
ysis. Wiley. 
Lanczos C (1988). Applied Analysis. Dover Publications. 


Strang G (1986). Introduction to Applied Mathematics. Wellesley- 
Cambridge. 


Acton and Lanczos are classics, written for the mathematically inclined. 
Atkinson is a more like a conventional textbook; so is Epperson. Strang 
is clear and concise, with a personal style, might be the place to start. 


Logistic regression: the brief history. The logistic curve was originally 
used to model population growth (Verhulst 1845, Yule 1925). If p(t) is the 
population at time t, Malthusian population theory suggested an equation of 
the form 


The solution is 
a 
p(t) = pon +c), 


where A is the logistic distribution function. (The first thing to check is that 
A'/A = 1 — A.) The linear function a — bp on the right hand side of the 
differential equation might be viewed by some as a first approximation to a 
more realistic decreasing function. 

In 1920, the population of the United States was 106 million, and models 
based on the logistic curve showed that the population would never exceed 
200 million (Pearl and Reed 1920, Hotelling 1927). As the US population 
increased beyond that limit, enthusiasm for the logistic growth law waned, 
although papers keep appearing on the topic. For reviews of population 
models, including the logistic, see Dorn (1950) and Hajnal (1955). Feller 
(1940) shows that normal and Cauchy distributions fit growth data as well as 
the logistic. 

An early biomedical application of logistic regression was Truett, Corn- 
field, and Kannel (1967). These authors fit a logistic regression to data from 
the Framingham study of coronary heart disease. The risk of death in the study 
period was related to a vector of covariates, including age, blood cholesterol 
level, systolic blood pressure, relative weight, blood hemoglobin level, smok- 
ing (at 3 levels), and abnormal electrocardiogram (a dummy variable). There 
were 2187 men and 2669 women, with 387 deaths and 271 subjects lost to 
followup (these were just censored). The analysis was stratified by sex and 
sometimes by age. 
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The authors argue that the relationship must be logistic. Their model 
seems to be like this, with death in the study period coded as Y; = 1, survival 
as Y; = 0, and X; a row vector of covariates. Subjects are a random sample 
from a population. Given Y; = 1, the distribution of X; is multivariate 
normal with mean u1. Given Y; = 0, the distribution is normal with the same 
covariance matrix G but a different mean uo. Then P(Y; = 1|X;) would 
indeed be logistic. This is easily verified, using Bayes’ rule and theorem 3.2. 

The upshot of the calculation: logitP (Y; = 1|X) = a + X;8, where 
p= G`! (u) — pọ) is the interesting parameter vector. The intercept is a 
nuisance parameter, a = logitP(Y; = 1) + 5(MyG luh — uG! u). If 
P(X; € dx|Y; = 1) = Cgexp(x)P(X; € dx|Y; = 0), conclusions are 
similar; again, there will be a nuisance intercept. 

According to Truett, Cornfield, and Kannel, the distribution of X; has to 
be multivariate normal, by the central limit theorem. But why is the central 
limit theorem relevant? Indeed, the distribution of X; clearly wasn’t normal: 
(i) there were dummy variables in X;, and (ii) data on the critical linear 
combinations are long-tailed. Furthermore, the subjects were a population, 
not a random sample. Finally, why should we think that parameters are 
invariant under interventions?? 


Regression and causation. Many statisticians find it surprising that re- 
gression and allied techniques are commonly used in the social and life sci- 
ences to infer causation from observational data, with qualitative inference 
perhaps more common than quantitative: X causes (or doesn’t cause) Y, the 
magnitude of the effect being of lesser interest. Eyebrows are sometimes 
raised about the whole idea of causation: 


“Beyond such discarded fundamentals as ‘matter’ and ‘force’ lies still 
another fetish amidst the inscrutable arcana of even modern science, 
namely, the category of cause and effect. Is this category anything but 
a conceptual limit to experience, and without any basis in perception 
beyond a statistical approximation?” (Pearson 1911, p. vi) 


8 


The Bootstrap 


8.1 Introduction 


The bootstrap is a powerful tool for approximating the bias and standard 
error of an estimator in a complex statistical model. However, results are 
dependable only if the sample is reasonably large. We begin with some 
toy examples where the bootstrap is not needed but the algorithm is easy to 
understand. Then we go on to applications that are more interesting. 


Example 1. The sample mean. Let X; be IID fori = 1,...,”, with 
mean u and variance o°. We use the sample mean X to estimate u. Is 
this estimator biased? What is its standard error? Of course, we know by 
statistical theory that the estimator is unbiased. We know the SE is o/,/n. 
And we know that o° can be estimated by the sample variance, 


2 lx = 
ô? = : XY X-K’. 
i=l 


(With large samples, it is immaterial whether we divide by n or n — 1.) 
For the sake of argument, suppose we’ve forgotten the theory but re- 
member how to use the computer. What can we do to estimate the bias in 
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X? to estimate the SE? Here comes the bootstrap idea at its simplest. Take 
the data—the observed values of the X;’s—as a little population. Simulate n 
draws, made at random with replacement, from this little population. These 
draws are a bootstrap sample. Figure 1 shows the procedure in box-model 
format. 


Figure 1. Bootstrapping the sample mean. 


xi BG Pa 


Let X},..., X% be the bootstrap sample. Each X; will come into the 
bootstrap sample some small random number of times, zero being a possible 
number, and in random order. From the bootstrap sample, we could estimate 
the average of the little population (the numbers in the box). The bootstrap 
estimator is just the average of the bootstrap sample: 


ie “eg 
X = — ) X. 
i=l 


(Why estimate something that we know? Because that gives us a benchmark 
for the performance of the estimator. . . .) 

One bootstrap sample may not tell us very much, but we can draw many 
bootstrap samples to get the sampling distribution of X”. Let’s index these 
samples by k. There will be a lot of indices, so we’ll put parens around the 
k. In this notation, the kth bootstrap estimator is X): we don’t need both a 
superscript x and a subscript (k). Suppose we have N bootstrap replicates, 
indexed by k = 1,..., N: 


XD- XO Oe 


Please keep separate: 

e N, the number of bootstrap replicates; 

e n, the size of the real sample. 
Usually, we can make N as large as we need, because computer time is cheap. 
Making n larger could be an expensive proposition. 
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What about bias? On the computer, we’re resampling from the real 
sample, whose mean is X. According to our rules of the moment, we’re not 
allowed to compute E (Xw) using probability theory. But we can approxi- 
mate the expectation by 


1 N 
X ave = NÈ X, 


the mean of the N bootstrap replicates. What we’ll see is 
Xave =X. 


In our simulation, the expected value of the sample mean is the population 
mean. The bootstrap is telling us that the sample mean is unbiased. 
Our next desire is the SE of the sample mean. Let 


N 
1 > > 2 
V= Woe [Xw — Xave| 
k=1 
This is the variance of the N bootstrap replicates. The SD is JV, which tells 
us how close a typical Xx) is to X. That’s what we’re looking for. 


The bootstrap SE is the SD of the bootstrap replicates. 


The bootstrap SE says how good the original X was, as an estimate for p. 

Why does this work? We’ve simulated k = 1, ..., N replicates of X,and 
used the sample variance to approximate the real variance. The only problem 
is this. We should be drawing from the distribution that the real sample came 
from. Instead, we’re drawing from an approximation, namely, the empirical 
distribution of the sample {X1,..., Xn}. See figure 1. If n is reasonably 
large, this is a good approximation. If n is small, the approximation isn’t 
good, and the bootstrap is unlikely to work. 


Bootstrap principle for the sample mean. Provided that the sam- 
ple is reasonably large, the distribution of X — X will be a good 
approximation to the distribution of X — jz. In particular, the SD 
of X“ will be a good approximation to the standard error of X. 


On the computer, we imitated the sampling model for the data. We 
assumed the data come from IID random variables, so we simulated IID data 
on the computer—drawing at random with replacement from a box. This is 
important. Otherwise, the bootstrap is doing the wrong thing. As a technical 
matter, we’ve been talking rather loosely about the bootstrap distribution of 
Xe X, but the distribution is conditional on the data X1, ..., Xn. 
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The notation is a little strange, and so is the terminology. For instance, 
X ave looks imposing, but it’s just something we use to check that the sample 
mean is unbiased. The “bootstrap estimator” X” is not a new estimator for 
the parameter u. It’s something we generate on the computer to help us 
understand the behavior of the estimator we started with—the sample mean. 
The “empirical distribution of the sample” isn’t a distribution for the sample. 
Instead, it’s an approximation to the distribution that we sampled from. The 
approximation puts mass 1/n at each of the n sample points. Lacking other 
information, this is perhaps the best we can do. 


Example 2. Regression. Suppose Y = Xf + e, where the design 
matrix X is nx p. Suppose that X is fixed (not random) and has full rank. 
The parameter vector £ is p x 1, unknown, to be estimated by OLS. The errors 
€|,.--, €n are IID with mean 0 and variance o”, also unknown. What is the 
bias in the OLS estimator Ê = (X’X)~! X'Y? What is the covariance matrix 
of B ? The answers, of course, are 0 and o2(X’X)~!; we would estimate o? 
as the mean square of the residuals. 

Again, suppose we’ve forgotten the formulas but have computer time on 
our hands. We’ll use the bootstrap to get at bias and variance. We don’t want 
to resample the Y;’s, because they’re not IID: E(Y;) = X; £ differs from one i 
to another, X; being the ith row of the design matrix X. The «€; are IID, but 
we can’t get our hands on them. A puzzle. 

Suppose there’s an intercept in the model, so the first column of X is 
all 1’s. Then € = 0, where e = Y — XB is the vector of residuals. We can 
resample the residuals, and that’s the thing to do. The residuals e),..., en 
are a new little population, whose mean is 0. We draw n times at random with 
replacement from this population to get bootstrap errors ef, ..., €;. These 
are IID and E(e*) = 0. The e¥ behave like the €,. Figure 2 summarizes the 
procedure. 


Figure 2. Bootstrapping a regression model. 


ey (2) os e 
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The next step is to regenerate the Y;’s: 
* XÊ +e. 


Each e; comes into €* some small random number of times (zero is a possible 
number) and in random order. So e; may get paired with X7 and X19. Or, 
e} may not come into the sample at all. The design matrix X doesn’t change, 
because we assumed it was fixed. Notice that Y* follows the regression 
model: errors are IID with expectation 0. We’ve imitated the original model 
on the computer. There is a difference, though. On the computer, we know 
the true parameter vector. It’s Ê. We also know the true distribution of 
the disturbances—IID draws from {e1,..., én}. So we can get our hands 
on the distribution of ĝ* — B, where ĝ* is the bootstrap estimator pr = 
(XX)! X'Y*. 

Bootstrap principle for regression. With a reasonably large n, the 

distribution of ĝ* — Ê is a good approximation to the distribution 


of p= p. In particular, the empirical covariance matrix of the B* 
is a good approximation to the theoretical covariance matrix of £. 


What is an “empirical” covariance matrix? Suppose we generate N 
bootstrap data sets, indexed by k = 1,..., N. For each one, we would have 
a bootstrap OLS estimator, Bao: We have N bootstrap replicates, indexed 
by k: R . . 

Bay, + +++ Bi, ++ Bay): 


The empirical covariance matrix is 


1 N N 
Tad Ba — Êave] [Bau — avel , where Bave = N P 
N 


This is something you can work out. By way of comparison, the theoretical 


covariance matrix depends on the unknown o°: 


E{[B — EBA - BBY} =a 


What about bias? As shown in chapter 4, there is no bias: E (B) = Bp. 
In the simulation, Êave = B, apart from a little bit of random error. After all, 
ĝ—the estimated £ in the real data—is what we told the computer to take as 
the true parameter vector. And Êave is the average of N bootstrap replicates 
Bao. which is a good approximation to E [w]. 
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On the computer, we imitated the sampling model for the data. By 
assumption, the real data came from a regression model with fixed X and 
IID errors having mean 0. That is what we had to simulate on the computer: 
otherwise, the bootstrap would have been doing the wrong thing. 

We’ve been talking about the bootstrap distribution of B* — B . This 
is conditional on the data Y1,..., Y,. After conditioning, we can treat the 
residuals—which were computed from Y,,..., Y,—as data rather than ran- 
dom variables. The randomness in the bootstrap comes from resampling the 
residuals. Again, the catch is this. We’d like to be drawing from the real dis- 
tribution of the €;’s. Instead, we’re drawing from the empirical distribution 
of the e;’s. If n is reasonably large and the design matrix is not too crazy, this 
is a good approximation. 


Example 3. Autoregression. There are parameters a,b. These are 
unknown. Somehow, we know that |b| < 1. Fori = 1,2,...,n, we have 
Y; = a +bY;i—1ı + €i. Here, Yo is a fixed number. The e; are IID with mean 0 
and variance o”, unknown. The equation has a lag term, Y;_,: this is the Y 
for the previous į. We’re going to estimate a and b by OLS, so let’s put this 
into the format of a regression problem: Y = X6 + e with 


Yı 1 Yo €l 

Y2 Yı a €2 
Y= š X= 5 B= b š € = 

Yn 1 Yn-1 En 


The algebra works out fine: the ith row in the matrix equation Y = XB + € 
gives us Y; = a+ bY;—1 + €i, which is where we started. The OLS estimator 
is B = (X'X)~!X’Y. We write â and É for the two components of B. 

But something is fishy. There is a correlation between X and e. Look at 
the second column of X. It’s full of €’s, tucked away inside the Y’s. Maybe 
we shouldn’t use 6*(X’X)~!? And what about bias? Although the standard 
theory doesn’t apply, the bootstrap works fine. We can use the bootstrap to 
estimate variance and bias, in this non-standard situation where explanatory 
variables are correlated with errors. 

The bootstrap can be done following the pattern set by example 2, even 
though the design matrix is random. You fit the model, getting B and residuals 
e=Y- XB. You freeze Yo, as well as 


afa 
=. 


and e. You resample the e’s to get bootstrap disturbance terms €7, ..., €}. 
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The new point is that you have to generate the Y 7s one at a time, using a, b, 
and the €;*’s: 


Y*#=4+bY)+er, 
YX =4+bYf4+é, 


The first line is OK because Yo is a constant. The second line is OK because 
when we need Y;", we have it from the line before. And so forth. So, we have 
a bootstrap data set: 


y* Yo Gi 

ye 1 o Yř G 
y* = , X* = ý e* = 

y* (me é rae 


Then we compute the bootstrap estimator, B* = (X*/X*)~!X*'Y*. Notice 
that we had to regenerate the design matrix because of the second column. 
(That is why X* deserves its x.) The computer can repeat this procedure 
many times, to get N bootstrap replicates. The same residuals e are used 
throughout. But e* changes from one replicate to another. So do X*, Y*, 
and p*. 


Bootstrap principle for autoregression. With a reasonably large n, 
the distribution of prs Bi is a good approximation to the distribution 
of Ê — B. In particular, the SD of b* isa good approximation to the 
standard error of b. The average of b* — bisa good approximation 
to the bias in b. 


In example 3, there will be some bias: the average of the b*’s will differ from 
b by a significant amount. The lag terms—the Y’s from the earlier i’s—do 
create some bias in the OLS estimator. 


Example 4. A model with pooled time-series and cross-sectional vari- 
ation. We combine example 3 above with example 2 in section 5.4. For 
t=1,...,mandj = 1,2, we assume 


Y, j = aj + bY;-1,; + Wij + €1,;- 
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Think of ¢ as time, and j as an index for geographical areas. The Yo, ; are 
fixed, as are the W’s. The a1, a2, b, c are scalar parameters, to be estimated 
from the data, W; j, Y, j fort = 1,...,m and j = 1,2. (For each t and 
J, W:,j and Y; j are scalars.) The pairs (€;,1, €;,2) are IID with mean 0 and 
a positive definite 2 x 2 covariance matrix K. This too is unknown and to 
be estimated. One-step GLS is used to estimate a1, a2, b, c—although the 
GLS model (5.7) doesn’t hold, because of the lag term: see example 3. The 
bootstrap will help us evaluate bias in feasible GLS, and the quality of the 
plug-in estimators for SEs (section 5.3). 

We have to get the model into the matrix framework. Let n = 2m. For 
Y, we just stack up the Y; ;: 


Yi 1 
Yı,2 
Y2 1 
y=| V22 


Ym.1 

Ym.2 
This is n x 1. Ditto for the errors: 

€1,1 


€1,2 
€2,1 


Em,1 
Em,2 


For the design matrix, we’ll need a little trick, so let’s do £ next: 


Now comes the design matrix itself: since Y is n x1 and £ is 4x 1, the 
design matrix has to be n x 4. The last column is the easiest: you just stack 
the W’s. Column 3 is also pretty easy: stack the Y’s, with a lag. Columns 1 
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and 2 have the dummies for the two geographical areas. These have to be 
organized so that a; goes with Y, ; and a2 goes with Y, 2: 


1 0 You Wii 
0 1 Yo,2 Wi2 
10 Yii W2, 
0 1 


Yi2 W2,2 


1 0 Ym-11 Wmı 
O Ll Yn-12 Wm2 


Let’s check it out. The matrix equation is Y = Xf + e. The first line of 
this equation says 


Yi1 =a + bYo1 +cW1 1 +11. 


Just what we need. The next line is 


Y12 = a + bYo,2 +cWi 2+ €1,2. 


This is good. And then we get 


Yo) =a, + bY, +cW21 +21, 


Yo2 = a2 + bY1 2 +cW22 + €22. 


These are fine, and so all are the rest. 

Now, what about the covariance matrix for the errors? It’s pretty easy to 
check that cov(eé) = G, where the n xn matrix G has K repeated along the 
main diagonal: 


K 02x%2 +++ O2x2 

Ox2 K > Onx2 
(1) = ; ; 3 ; 
02x2 Ox2 ++: K 


Before going on to bootstrap the model, let’s pause here to review one- 
step GLS—sections 5.3—4. You make a first pass at the data, estimating B 
by OLS. This gives Bots. with a residual vector e = Y — X Bots. We use e 
to compute an estimate K for K. (We’ll also use the residuals for another 
purpose, in the bootstrap.) Then we use K to estimate G. Notice that the 
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residuals naturally come in pairs. There is one pair for each time period, 
because there are two geographical areas. Rather than a single subscript on e 
it will be better to have two, t and j, witht = 1,...,m for time and j = 1,2 
for geography. Let 


1,1 =en- and e;2 = ex. 


This notation makes the pairing explicit. 
Now K is the empirical covariance matrix of the pairs: 


x, MMe 
(2) K=— ye (2) (e1 e,2). 
m t 


Plug K into the formula (1) for G to get G, and then G into (5.10) to get 
(3) Brats = (X'G'X)"1X'GTY. 


This is one-step GLS. The “F” in Bros is for “feasible.” Plug G into the 
right hand side of (5.12) to get an estimated covariance matrix for rots, 
namely, 


(4) OG xy) 


Feasible GLS may be biased, especially with a lag term. And (4) is 
only an “asymptotic” formula: under some regularity conditions, it gives 
essentially the right answers with large samples. What happens with small 
samples? What about the sample size that we happen to have? And what 
about the bias?? The bootstrap should give us a handle on these questions. 

Resampling the Y’s is not a good idea: see example 2 for the reasoning. 
Instead, we bootstrap the model following the pattern in examples 2 and 3. 
We freeze Yo, ; and the W’s, as well as 


D & 
D = 


BrcLs = 


O œ 


and the residuals e from the OLS fit. To regenerate the data, we start by 
resampling the e’s. As noted above, the residuals come in pairs. The pairing 
has to be preserved in order to capture the covariance between €; 1 and €; 2. 
Therefore, we resample pairs of residuals (figure 3). 


THE BOOTSTRAP 165 


Figure 3. Bootstrapping a model with pooled time-series and cross- 
sectional variation. 


C11) 12) [eae] = Eni 


En2 


El Ei) [E21 E22) + Ent Emn2 


m m 


More formally, we generate IID pairs (Gi E E2), choosing at random 
with replacement from the paired residuals. The chance that (€*,,€*,) = 
(e7,1, €7,2) is 1/m. Ditto if 7 is replaced by 19. Or any other number. Since 
we have a; and az in the model, )7""_, es,1 = Do” €s,2 = 0. (For the proof, 
e is orthogonal to the columns of X: the first two columns are the relevant 
ones.) In other words, E(ef D=E (ež 2) = 0. We have to pou the 
x 


’s, as in example 3, one t at a time, using 41, a, b, and the e* .’s 


J tjs 


Yr) = 4, + bYo,ı + ĉW1,1 GEE 


Yř = Go bYo,2 + ĉW1,2 GEF 


Yžı =â bY W2, 15 


* Lyx ^ * 
Yo = a2 bY» cW2,2 E22 , 


and so forth. No need to regenerate Yo ; or the W’s: these are fixed. Now 
we bootstrap the estimator, getting Brave: This means doing OLS, getting 
residuals, then K* as in (2), then plugging K* into (1) to get G*; finally, 


(5) Bicis = (G xy OYA; 


We have to do this many times on the computer, to get some decent approxi- 
mation to the distribution of Biss B. Notice the stars on the design matrix 
in (5). When a bootstrap design matrix is generated on the computer, the 
column with the Y’s changes every time. 


Bootstrap principle for feasible GLS. With a reasonably large n, 
the distribution of B= BEGLs i is a good approximation to the 
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distribution of BrcLs — B. In particular, the empirical covariance 
matrix of Bios i is a good approximation to the theoretical covari- 
ance matrix of BroLs- The average of Bears — Bros is a good 
approximation to the bias in BrcLs- 


More specifically, we would simulate N data sets, indexed by k = 
1,..., N. Each data set would consist of simulated design matrix X(x) and a 
simulated response vector Yx). For each data set, we would compute Gu) 
and a bootstrap replicate of the one-step GLS estimator, 


: sai 
Brcis,(k) = [Xk 5 Gis Xal Xi Ge Yo 


Some things don’t depend on k: for instance, Yo; and W; j. We keep 
ÊroLs—the one-step GLS estimate from the real data— fixed throughout, as 
the true parameter vector in the simulation. We keep the error distribution 
fixed too: the box in figure 3 stays the same through all the bootstrap repli- 
cations. 

This is a complicated example, but it is in this sort of example that you 
might want to use the bootstrap. The standard theory doesn’t apply. There will 
be some bias, which can be detected by the bootstrap. There probably won’t 
be any useful finite-sample results, although there may be some asymptotic 
formula like (4). The bootstrap is also asymptotic, but it often gets there faster 
than the competition. The next section has a real example, with a model for 
energy demand. Work the exercises, in preparation for the example. 


Exercise setA 


1. Let X1,..., X59 be ID N (u, o*). The sample mean is X. True or false: 
X is an unbiased estimate of u, but is likely to be off u by something 
like ø /v 50, just due to random error. 


2. Let Xi() be IID N(u, 07), fori = 1,..., 50 and k =1,..., 100. Let 


50 50 
= 1 2 1 > 72 
X% = y Xi» Sk) = 5g le Xol, 
i=l 


i=l 


100 100 


z = 1 Bure jak sah 
Yee xy. . VS [X 
ave ™ T00 2 w 100 4 [Xu 


a Xave] s 


True or false, and explain: 
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(a) [Xw : k =1,..., 100} is a sample of size 100 from N (u, o?/50). 

(b) V is around o?/50. 

(c) |X (x) =X els 2/V for about 95 of the k’s. 

(d) VV is a good approximation to the SE of X, where X was defined 
in exercise 1. 


(e) The sample SD of the X(j’s is a good approximation to the SE 
of X. 


(f) Xave is Nw, o7/5000). 


3. (This continues exercise 2.) Fill in the blanks, and explain. 


(a) Xave is nearly u, but is off by something like . Options: 
o o/ v50 o//100 o/ 5000 
(b) Xave is nearly u, but is off by something like . Options: 
VV VV/V50.— VV /V100 = V/V //5000 
(c) The SD of the X (y's is around . Options: 


VV VV/V50.— VV /V100 = /V//5000 


Exercises 1-3 illustrate the parametric bootstrap: we’re resampling from a 
given parametric distribution, the normal. The notation looks awkward, but 
will be handy later. 


8.2 Bootstrapping a model for energy demand 


In the 1970s, long before the days of the SUV, we had an energy crisis 
in the United States. An insatiable demand for Arab oil, coupled with an 
oligopoly, led to price controls and gas lines. The crisis generated another 
insatiable demand, for energy forecasts. The Department of Energy tried to 
handle both problems. This section will discuss RDFOR, the Department’s 
Regional Demand Forecasting model for energy demand. 

We consider only the industrial sector. (The other sectors are residential, 
commercial, transportation.) The chief equation was this: 


(6) QO; =aj+bC,,; + cH; +d Py; +eQ:1,7 + fVi,j trj. 


Here, t is time in years: t = 1961, 1962, ..., 1978. The index j ranges over 
geographical regions, 1 through 10. Maine is in region | and California in 
region 10. On the left hand side, Q;,; is the log of energy consumption by 
the industrial sector in year ¢ and region j. 

On the right hand side of the equation, Q appears again, lagged by a year: 
Q;—1,;. The coefficient e of the lag term was of policy interest, because e 
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was thought to measure the speed with which the economy would respond to 
energy shocks. Other terms can be defined as follows. 


e C; j is the log of cooling degree days in year t and region j. Every day 
that the temperature is one degree above 65° is a cooling degree day: 
energy must be supplied to cool the factories down. If we have 15 days 
with a temperature of 72°, that makes 15 x (72 — 65) = 105 cooling 
degree days. It’s conventional to choose 65° as the baseline temperature. 
Temperatures are in Fahrenheit: this is the US Department of Energy. 


e H, j is the log of heating degree days in year ¢ and region j. Every day 
that the temperature is one degree below 65° is a heating degree day: 
energy must be supplied to heat the factories up. If we have 15 days with 
a temperature of 54°, that makes 15 x (65 — 54) = 165 heating degree 
days. 


e P, j is the log of the energy price for the industrial sector in year t and 
region j. 

e V; j is the log of value added in the industrial sector in year t and region 
j. “Value added” means receipts from sales less costs of production; the 
latter include capital, labor, and materials. (This is a quick sketch of a 
complicated national-accounts concept.) 


e There are 10 region-specific intercepts, aj. There are 5 coefficients 
(b,c, d,e, f) that are constant across regions, making 10 +5 = 15 
parameters so far. Watch it: e is a parameter here, not a residual vector. 


e 6 is an error term. The (ô+; : j = 1,...,10) are IID 10-vectors for 
t = 1961,..., 1978, with mean 0 and a 10 x 10 covariance matrix K 
that expresses inter-regional dependence. 


e The 6’s are independent of all the right hand side variables, except the 
lag term. 


Are the assumptions sensible? For now, don’t ask, don’t tell: it won’t matter 
in the rest of this section. (The end notes comment on assumptions.) 

The model is like example 4, with 18 years of data and 10 regions rather 
than 2. Analysts at the Department of Energy estimated the model by feasible 
GLS, equation (3). Results are shown in column A of table 1. For instance, 
the lag coefficient e is estimated as 0.684. Furthermore, standard errors 
were computed by the “plug-in” method, equation (4). Results are shown 
in column B. The standard error on the 0.684 is 0.025. The quality of these 
plug-in standard errors is an issue. Bias is also an issue, for two reasons. 
(i) There is a lag term. (ii) The covariance matrix of the errors has to be 
estimated from the data. 


THE BOOTSTRAP 169 


Feasible GLS is working hard in this example. Besides the 10 intercepts 
and 5 slopes, there is a 10 x 10 covariance matrix that has to be estimated 
from the data. The matrix has 10 variances on the diagonal and 45 covariances 
above the diagonal. We only have 18 years of data on 10 regions—at best, 
180 data points. The bootstrap will show there is bias in feasible GLS. It will 
also show that the plug-in SEs are seriously in error. 

We bootstrap the model just as in the previous section. This involves 
generating 100 simulated data sets on the computer. We tell the computer to 
take BrGLs. column A, as ground truth for the parameters. (This is a truth 
about the computer code, not a truth about the economy.) What do we use for 
the errors? Answer: we resample the residuals from the OLS fit. This is like 
example 4, with 18 giant tickets in the box, each ticket being a 10-vector of 
residuals. For instance, 1961 contributes a 10-vector with a component for 
each region. So does 1962, and so forth, up to 1978. 

When we resample, each ticket comes out a small random number of 
times (perhaps zero). The tickets come out in random order too. For exam- 
ple, the 1961 ticket might get used to simulate 1964 and again to simulate 
1973; the 1962 ticket might not get used at all. What about the explanatory 


Table 1. Bootstrapping RDFOR. 


One-step GLS Bootstrap 
(A) (B) (C) (D) (E) (Œ) 
RMS RMS 
Plug-in plug-in bootstrap 

Estimate SE Mean SD SE SE 

ai —.95 .31 —.94 54 19 43 

az —1.00 31 —.99 55 19 43 

a3 —.97 31 —.95 55 19 43 

a4 —.92 30 —.90 .53 .18 Al 

as —.98 32 —.96 55 19 44 

a6 —.88 .30 —.87 .53 .18 Al 

a7 —.95 32 —.94 55 19 44 

ag —.97 32 —.96 .55 .19 44 

ag —.89 .29 —.87 .51 .18 .40 
aio —.96 31 —.94 54 19 42 
cdd b 022 013 021 025 .0084 .020 
hdd c .10 031 099 = =.052~—-.019 .043 
price d —.056 019 —.050 .028 011 .022 
lag e .684 .025 .647 .042 = .017 .034 


va f 281 021 310 .039 = .014 029 
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variables on the right hand side of (6), like cooling degree days? We just leave 
them as we found them; they were assumed exogenous. Similarly, we leave 
Q 1960, ; alone. The lag terms for t = 1961, 1962, ... have to be regenerated 
as we go. 

For each simulated data set, we compute a one-step GLS estimate, Brais 
This is a 15 x 1 vector (10 regional intercepts, 5 coefficients). The mean of 
these vectors is shown in column C. For example, the coefficient of the lag 
term is 14th in order, so é* is the 14th entry in Brie The mean of the 100 
é*’s is 0.647. The SD of the 100 bootstrap estimates is shown in column D. 
For instance, the SD of the 100 é*’s is 0.042. The bootstrap has now delivered 
its output, in columns C and D. We will use the output to analyze variance 
and bias in feasible GLS. (Columns E and F will be discussed momentarily.) 

Variance. The bootstrap SEs are just the SDs in column D. To review 
the logic, the 100 é*’s are a sample from the true distribution—true within 
the confines of the computer simulation. The mean of the sample is a good 
estimate for the mean of the population, i.e., the expected value of é*. The 
SD of the sample is a good estimate for the SD of é*. This tells you how far 
the FGLS estimator is likely to get from its expected value. (If in doubt, go 
back to the previous section.) 

Plug-in SEs vs the bootstrap. Column B reports the plug-in SEs. Col- 
umn D reports the bootstrap SEs. Comparing columns B and D, you see that 
the plug-in method and the bootstrap are very different. The plug-in SEs are 
a lot smaller. But, maybe the plug-in method is right and the bootstrap is 
wrong? That is where column E comes in. Column E will show that the 
plug-in SEs are a lot too small. (Column E is special; the usual bootstrap 
stops with columns C and D.) 

For each simulated data set, we compute not only the one-step GLS es- 
timator but also the plug-in covariance matrix. The square root of the mean 
of the diagonal is shown in column E. Within the confines of the computer 
simulation—where the modeling assumptions are true by virtue of the com- 
puter code—column D gives the true SEs for one-step GLS, up to a little 
random error. Column E tells you what the plug-in method is doing, on av- 
erage. The plug-in method is too small, by a factor of 2 or 3. Estimating 
all those covariances is making the data work too hard. That is what the 
bootstrap has shown us. 

Bias. As noted above, the mean of the 100 é*’s is 0.647. This is some- 
what lower than the assumed true value of 0.684 in column A. The difference 
may look insignificant. Look again. We have a sample of size 100. The sam- 
ple average is 0.647. The sample SD is 0.042. The SE for the sample average 
is 0.042/./100 = 0.0042. (This SE is special: it measures random error in 
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the simulation, which has “only” 100 replicates.) Bias is highly significant, 
and larger in size than the plug-in SE: see column B. The bootstrap has shown 
that FGLS is biased. 

Some details. The bootstrap is a bit complicated. Explicit notation may 
make the story easier to follow. We’re going to have 100 simulated data sets. 
Let’s index these by a subscript k = 1,..., 100. We put parens around k 
to distinguish it from other subscripts. Thus, Q;, j,k) is the log quantity of 
energy demand in year t and region j, in the kth simulated data set. The 
response vector Yx) in the kth data set is obtained by stacking up the Q;, j,- 
First we have Q1961,1,(k)> then Q1961,2,(k)> and so on, down to Q1961,10,(k)- 
Next comes Q1962,1,(x), and so forth, all the way down to Q1978,10,(4). In 
terms of a formula, Q; jœ) is the [10(¢ — 1961) + j]th entry in Yœ), for 
t = 1961, 1962, ... and j =1,..., 10. 

There’s no need to have a subscript (k) on the other variables, like cooling 
degree days or value added: these don’t change. The design matrix in the kth 
simulated data set is X). There are 10 columns for the regional dummies 
(example 4 had two regional dummies), followed by one column each for 
cooling degree days, heating degree days, price, lagged quantity, value added. 
These are all stacked in the same order as Yg). Most of the columns stay the 
same throughout the simulation, but the column with the lags keeps changing. 
That is why a subscript k is needed on the design matrix. 

For the kth simulated data set, we compute the one-step GLS estimator 
as 


g fot ates eh. E 
(7) Brois.&) = [Xo ww] Lo ww. 


where G p 18 estimated from OLS residuals in a preliminary pass through the 
kth simulated data set. Here is a little more detail. The formula for the OLS 
residuals is 


(8) Y 


1 -lyy 
) ~ X@lXw@Xwl XoY 


(k)? 


The OLS residual r; j, x) for year t and region j is the [10(¢ — 1961) + ji” 
entry in (8). (Why r? Because e is a parameter.) For each year from 1961 
through 1978, we have a 10-vector of residuals, whose empirical covariance 
matrix is 


Tt,1,(k) 


1,2, (k) 
Kw = 78 Se (Garo Veh. <%s r1.10,0)) 


Tt,10,(k) 
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If in doubt, look back at example 4. String 18 copies of K (k) down the diagonal 
of a 180 x 180 matrix to get the Gu in (7): 


K O1ox10 **: 010x10 
a 010x10 K < 010x10 
Gk) = ; . : 
O1ox10 Qioxio =: K 


The kth replicate bootstrap estimator BEGLs. cœ in (7) is a 15-vector, with 
estimates for the 10 regional intercepts followed by bu), Css diy, ê €(k)> fw: 
The simulated estimate for the lag coefficient é(,) is therefore the 14th entry 
in BEGLS,(k)- The 0.647 under column C in the table was obtained as 


1 100 
Cave = 100 2 C(k)- 


Up to a little random error, this is E[é(x)], i.e., the expected value of the 
one-step GLS estimator in the simulation. The 0.042 was obtained as 


100 


100 9 elk) — baa) 


Up to another little random error, this is the SE of the one-step GLS estimator 
in the simulation. (Remember, e is a parameter not a residual vector.) 

For each simulated data set, we compute not only the one-step GLS 
estimator but also the plug-in covariance matrix 


24 
(9) [XoXo] 


We take the mean over k of each of the 15 diagonal elements in (9). The 
square root of the means goes into column E. That column tells the truth 
about the plug-in SEs: they’re much too small. 

The squaring and unsquaring may be a little hard to follow, so let’s try a 
general formula. We generate a sequence of variances on the computer. The 
square root of each variance is an SE. Then 


RMS SE = mean (SE’) = /mean variance. 
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Bootstrapping the bootstrap. Finally, what about the bootstrap? Does 
it do any better than the asymptotics? It turns out we can calibrate the boot- 
strap by doing an even larger simulation (column F). For each of our 100 
simulated data sets [X (x), Y(x)], we compute the analog of column D. For this 
purpose, each simulated data set spawns 100 simulated data sets of its own. 
Allin all, there are 100? = 10,000 data sets to keep track of, but with current 
technology, not a problem. For each simulated data set, we get simulated 
bootstrap SEs on each of the 15 parameter estimates. The RMS of the simu- 
lated bootstrap SEs is shown in column F. The bootstrap runs out of gas too, 
but it comes a lot closer to truth (column D) than the plug-in SEs (column E). 

As noted before, usual applications of the bootstrap stop with columns 
C and D. Columns E and F are special. Column E uses the bootstrap to check 
on the plug-in SEs. Column F uses the bootstrap to check on itself. 

What is truth? For the simulation, column C gives expectations and 
D gives SEs (up to a little random error). For the real data, these are only 
approximations, because (i) the real world may not follow the model, and 
(ii) even if it did, we’re sampling from the empirical distribution of the resid- 
uals, not the theoretical distribution of the errors. If the model is wrong, the 
estimates in column A of table 1 and their SEs in column B are meaningless 
statistics. If the model is right, the estimates in column A are biased, and the 
SEs in column B are too small. This is an extrapolation from the computer 
model to the real world. 


Exercise set B 


1. There is a statistical model with a parameter 6. You need to estimate 0. 
Which is a better description of the bootstrap? Explain briefly. 
(i) The bootstrap will help you find an estimator for 0. 
(ii) Given an estimator 6 for 0, the bootstrap will help you find the bias 
and SE of 6. 


2. Which terms in equation (6) are observable, and which are unobservable? 
Which are parameters? 

3. Does the model reflect the idea that energy consumption in 1975 might 
have been different from what it was? If so, how? 

4. In table 1, at the end of column A, you will find the number 0.281. How 
is this number related to equation (6)? 

5. To what extent are the one-step GLS estimates biased in this application? 
Which numbers in the table prove your point? How? 

6. Are plug-in SEs biased in this application? Which numbers in the table 
prove your point? How? 
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7. Are bootstrap standard errors biased in this application? Which numbers 
in the table prove your point? How? 


8. Paula has observed values on four independent random variables with 
common density fy g(x) = c(a@, B)(ax — B)* exp[—(ax — B)?], where 
a > 0, —oo < B < œ, and c(«, B) is chosen so that JS fa,plx)dx = 
1. She estimates œ, 8 by maximum likelihood and computes the stan- 
dard errors from the observed information. Before doing the t-test to 
see whether £ is significantly different from 0, she consults a statisti- 
cian, who tells her to use the bootstrap because observed information is 
only useful with large samples. What is your advice? (See discussion 
question 7.15.) 


9. (Hard.) In example 3, if 1 < i < n, show that E(e;|X) = €i. 


8.3 End notes for chapter 8 


Terminology. In the olden days, boots had straps so you could pull them 
on. The term “bootstrap” comes from the expression, to lift yourself up by 
your own bootstraps. 


Theory. Freedman (1981, 1984) describes the theoretical basis for apply- 
ing the bootstrap to different kinds of regression models, with some asymp- 
totic results. 


Centering. In example 2, without an intercept, you would have to center 
the residuals. Likewise, in example 4, you need the two regional intercepts 
a1, a2. With RDFOR, it is the 10 regional intercepts that center the residuals. 
Without centering, the bootstrap may be way off (Freedman 1981). 


Which set of residuals? We could resample FGLS residuals. However, 
G in (4) is computed from the OLS residuals. A comparison between asymp- 
totics and the bootstrap seemed fairer if OLS residuals were resampled in the 
latter, so that is what we did. 


Autoregression. A regression of Y, on “lagged” values (e.g., Y;-1) and 
control variables is called an “autoregression,” with “auto” meaning self: Y 
is explained in part by its own previous values. With the autoregression in 
example 3, if |b| < 1 the conventional theory is a good approximation when 
the sample size is large; however, if |b| > 1, the theory gets more complicated 
(Anderson 1959). Bias in coefficient estimates due to lags is a well-known 
phenomenon (Hurwicz 1950). Bias in asymptotic standard errors is a less 
familiar topic. 

RDFOR. The big problem with the bootstrap is that the residuals are too 
small. For OLS, there is an easy fix: divide by n — p, not n. In a complicated 
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model like RDFOR, what would you use for p? The right answer turns out 
to depend on unknown parameters: feasible GLS isn’t real GLS. Using the 
bootstrap to remove bias is tempting, but the reduction in bias is generally 
offset by an increase in variance. Doss and Sethuraman (1989) have a theorem 
which captures this idea. 

Section 2 is based on Freedman and Peters (1984abc, 1985). Technically, 
P, j is a price index and Q; j is a quantity index. (“Divisia” indices were 
used in constructing the data.) Further simulation studies show the bias in 
FGLS is mainly due to the presence of the lag term. 

RDFOR, developed by the Department of Energy, is somewhat unre- 
alistic as a model for energy demand (Freedman-Rothenberg-Sutch 1983). 
Among other things, P and ô can scarcely be independent (chapter 9). How- 
ever, failures in the model do not explain bias in FGLS, or the poor behavior 
of the plug-in SEs. Differences between columns A and C in table 1, or 
differences among columns D-E-F, cannot be due to specification error. The 
reason is this. In the computer simulation, the model holds true by virtue of 
the coding. 

In fact, the Department of Energy estimated the model using iteratively 
reweighted least squares (section 5.4) rather than one-step GLS. Iteration 
improves the performance of ĝ, but the bias in the estimated SEs gets worse. 
In other examples, iteration degrades the performance of B. 


Plug-in SEs. These are more politely referred to as nominal or asymp- 
totic SEs: “nominal” contrasts with “actual,” and asymptotics work when the 
sample is large enough (see below). 


Other papers. The bias in the plug-in SEs for feasible GLS is redis- 
covered from time to time. See, e.g., Beck (2001) or Beck and Katz (1995). 
These authors recommend White’s method for estimating the SEs in OLS 
(end notes to chapter 5). However, “robust SEs” may have the same sort of 
problems as plug-in SEs, because estimated covariance matrices can be quite 
unstable. As a result, t-statistics will show unexpected behavior. Moreover, 
in the applications of interest, feasible GLS is likely to give more accurate 
estimates of the parameters than OLS. 


9 


Simultaneous Equations 


9.1 Introduction 


This chapter explains simultaneous-equation models, and how to esti- 
mate them using instrumental variables (or two-stage least squares). These 
techniques are needed to avoid simultaneity bias (aka endogeneity bias). The 
lead example will be hypothetical supply and demand equations for butter in 
the state of Wisconsin. The source of endogeneity bias will be explained, and 
so will methods for working around this problem. 

Then we discuss two real examples—(i) the way education and fertility 
influence each other, and (ii) the effect of school choice on social capital. 
These examples indicate how social scientists use two-stage least squares 
to handle (i) reciprocal causation and (ii) self-selection of subjects into the 
sample. (In the social sciences, two-stage least squares is often seen as the 
solution to problems of statistical inference.) At the end of the chapter there 
is a literature review, which puts modeling issues into a broader perspective. 

We turn now to butter. Supply and demand need some preliminary 
discussion. For an economist, butter supply is not a single quantity but a 
relationship between quantity and price. The supply curve shows the quantity 
of butter that farmers would bring to market at different prices. In the left 
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Figure 1. Supply and demand. The vertical axis shows quantity; 
the horizontal axis, price. 


Supply Demand Market Clearing 


hand panel of figure 1, price is on the horizontal axis and quantity on the 
vertical. (Economists usually do it the other way around.) 

Notice that the supply curve slopes up. Other things being equal— 
ceteris paribus, as they say—if the price goes up so does the quantity offered 
for sale. Farmers will divert their efforts from making cheese or delivering 
milk to churning butter. If the price gets high enough, farmers will start 
buying suburbs and converting them back to pasture. As you can see from 
the figure, the curve is concave: each extra dollar brings in less butter than 
the dollar before it. (Suburban land is expensive land.) 

Demand is also a relationship between quantity and price. The demand 
curve in the middle panel of figure 1 shows the total amount of butter that 
consumers would buy at different prices. This curve slopes down. Other 
things being equal, as price goes up the quantity demanded goes down. This 
curve is convex—one expression of “the law of diminishing marginal utility.” 
(The second piece of cake is never as good as the first; if you will pay $10 
for the first piece, you might only pay $8 for the second, and so forth: that is 
convexity of P as a function of Q.) 

According to economic theory, the free market price is determined by the 
crossing point of the two curves. This “law of supply and demand” is illus- 
trated in the right hand panel of figure 1. At the free market price, the market 
clears: supply equals demand. If the price were set lower, the quantity de- 
manded would exceed the quantity supplied, and disappointed buyers would 
bid the price up. If the price were set higher, the quantity supplied would ex- 
ceed the quantity demanded, and frustrated suppliers would lower their prices. 
With price control, you just sell the butter to the government. That is why price 
controls lead to butter mountains. With rent control, overt bidding is illegal; 
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there is excess demand for housing, as well as under-the-counter payments 
of one kind or another. Relative to free markets, politicians set rents too low 
and butter prices too high. 

Supply curves and demand curves are response schedules (section 6.4). 
The supply curve shows the response of farmers to different prices. The 
demand curve shows the response of consumers. These curves are somewhat 
hypothetical, because at any given time, we only get to see one price and one 
quantity. The extent to which supply curves and demand curves exist, in the 
sense that (say) planetary orbits exist, may be debatable. For now, let us set 
such questions aside and proceed with the usual theory. 

Other things affect supply and demand besides price. Supply is affected 
by the costs of factors of production, e.g., the agricultural wage rate and 
the price of hay (labor and materials). These are “determinants of supply.” 
Demand is affected by prices for complements (things that go with butter, like 
bread) and substitutes (like olive oil). These are “determinants of demand.” 
The list could be extended. 

Suppose the supply curve is stable while the demand curve moves around 
(left hand panel, figure 2). Then the observations—the market clearing prices 
and quantities—would trace out the supply curve. Conversely, if the supply 
curve shifts while the demand curve remains stable, the observations would 
trace out the demand curve (middle panel). In reality, as economists see 
things, both curves are changing, so we get the right hand panel of figure 2. 
To estimate the curves, more assumptions must be introduced. Economists 
call this “specifying the model.” We need to specify the determinants of 
supply and demand, as well as the functional form of the curves. 


Figure 2. Tracing out supply and demand curves. The vertical axis 
shows quantity; the horizontal axis, price. 


Demand curve shifts Supply curve shifts 
Supply curve stable Demand curve stable Both curves moving 
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Our model has two “endogenous variables,” the quantity and price of 
butter, denoted Q and P. The specification will say how these endogenous 
variables are determined by “exogenous variables.” The exogenous variables 
in our supply equation are the agricultural wage rate W and the price H of 
hay. These are the determinants of supply. The exogenous variables in the 
demand equation are the prices F of French bread and O of olive oil. These 
are the determinants of demand. For the moment, “exogeneity” just means 
“externally determined” and “endogeneity” means “determined within the 
model.” Technical definitions will come shortly. 

We consider a linear specification. The model has two linear equations 
in two unknowns, Q and P. For each time period t, 


(la) Supply Q = ao +aıP +a2W +a3H +ô, 
(1b) Demand Q = bo + bı P + bF + b30 + &. 


On the right hand side, there are parameters, the a’s and b’s. There is price P. 
There are the determinants of supply in (1a) and the determinants of demand in 
(1b). There are random disturbance terms 6; and €;: otherwise, the data would 
never fit the equations. Everything is linear and additive. (Linearity makes 
things simple; however, economists might transform the variables in order 
to get curves like those sketched in figures 1 and 2.) Notice the restrictions, 
which are sensible enough: W, H are excluded from the demand equation; 
F, O from the supply equation. 

To complete the specification, we need to make some assumptions about 
(ôr, €t). Error terms have expectation 0. As pairs, (ô+, €+) are independent and 
identically distributed for t = 1, ..., n, but ô is allowed to be correlated with 
€;. The variance of 6, and the variance of e; may be different. Equation (la) 
is a linear supply schedule; (1b) is a linear demand schedule. We should 
write Q; p,w,H,F,o instead of Q—after all, these are response schedules— 
but inconsistency seems a better choice. 

Each equation describes a hypothetical experiment. In (la), we set 
P, W, H, F, O, and observe how much butter the farmers bring to market. 
By assumption, F and O have no effect on supply: they’re not in the equation. 
On the other hand, P, W, H should have additive linear effects. In (1b), we 
set P, W, H, F, O and observe how much butter the consumers will buy: 
W and H should have no effect on demand, while P, F, O should have 
additive linear effects. The disturbance terms are invariant under all interven- 
tions. So are the parameters, which remain the same for all combinations of 
W, H, F, O. 

There is a third hypothetical experiment, which could be described by 
taking equations (la) and (1b) together. The exogenous variables W, H, F, O 
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can be set to any particular values of interest, perhaps within certain ranges, 
and the two equations solved together for the two unknowns Q and P, giving 
us the quantity and price we would see in a free market—with the prescribed 
values for the exogenous variables. 

So far, we have three hypothetical experiments, where we can set the 
exogenous variables. In the social sciences, experiments are unusual. More 
often, equations are estimated using observational data. Another assumption 
is needed: that Nature runs experiments for us. 

Suppose, for instance, that we have 20 years of data in Wisconsin. 
Economists would assume that Nature generated the data as if by choosing 
W,, Ay, Fi, Or fort = 1,..., 20 from some joint distribution, independently 
of the 5’s and e’s. Thus, by assumption, W;, H;, F;, O; are independent of 
the error terms. This is “exogeneity” in its technical sense. 

Nature substitutes her values for W;, H;, F;, Or into the right hand side 
of (1a) and (1b). She gets the supply and demand equations that are operative 
in year t: 


(2a) Supply Q = aọ +aı P +a2W; + a3 H; + ôt, 
(2b) Demand Q = bo + bı P + bF; + b30; +6. 


According to the model—here comes the law of supply and demand— 
the market price P, and the quantity sold Q; in year t are determined as if by 
solving (2a) and (2b) for the two unknowns Q and P: 


_ ay(bo + b2F; + 630; + €r) — bi (ao + a2W; + a3 + ôr) 
7 ay — by i 


Ga) Q; 


= (bo + bo F; + b30; + €r) — (ao + a2 W; + a3 H; + ôt) 


3b P 
(3b) t T 


We do not get to see the parameters or the disturbance terms. All we get 
to see are Q;, P;, and the exogenous variables W,;, H;, F;, O;. Our objective 
is to estimate the parameters in (2a)-(2b), from these observational data. That 
will tell us, for example, how farmers and consumers would respond to price 
controls. The model allows us to make causal inferences from observational 
data—if the underlying assumptions are right. 

A regression of Q; on P, and the exogenous variables leads to simul- 
taneity bias, also called endogeneity bias, because there are disturbance terms 
in the formula (3b) for P;. Generally, P, will be correlated with 6; and ez. 
In other words, P; is endogenous. That is the new statistical problem. Of 
course, Q; is endogenous too: there are disturbance terms in (3a). 
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This section presented a simple econometric model with a supply equa- 
tion and a demand equation—equations (2a) and (2b). The source of en- 
dogeneity bias was identified: disturbance terms turn up in formulas (3ab) 
for Q; and P;. (These “reduced form” equations are of no further interest 
here, although they may be helpful in other contexts.) The way to get around 
endogeneity bias is to estimate equations (2a) and (2b) by instrumental vari- 
ables rather than OLS. This new technique will be explained in sections 2 
and 3. Section 7.4 discussed endogeneity bias in a different kind of model, 
with a binary response variable. 


Exercise setA 


In equation (la), should a; be positive or negative? What about a2, a3? 
2. In equation (1b), should b; be positive or negative? What about b2, b3? 
3. In the butter model of this section: 
(a) Does the law of supply and demand hold true? 
(b) Is the supply curve concave? strictly concave? 
(c) Is the demand curve convex? strictly convex? 
(Economists prefer log linear specifications. . . .) 


4. An economist wants to use the butter model to determine how farmers 
will respond to price controls. Which of the following equations is the 
most relevant—(2a), (2b), (3a), (3b)? Explain briefly. 


9.2 Instrumental variables 


We begin with a slightly abstract linear model 
(4) Y= XB + ô, 


where Y is an observable n x 1 random vector, X is an observable n x p 
random matrix, and £ is an unobservable p x | parameter vector. The ô; are 
IID with mean 0 and finite variance o7; they are unobservable random errors. 
This is the standard regression model, except that X is endogenous, i.e., X 
and ô are dependent. Conditional on X, the OLS estimates are biased by 
(X’X)~!X’E(5|X): see (4.9). This is simultaneity bias. 

We can explain the bias another way. In the OLS model, we could have 
obtained the estimator as follows: multiply both sides of (4) by X’, drop X’5 
because it’s small—E (X'S) = 0—and solve the resulting p equations for the 
p unknown components of 8. Here, however, E(X’5) Æ 0. 

To handle simultaneity bias, economists and other social scientists would 
estimate (4) using instrumental-variables regression, also called two-stage 
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least squares: the acronyms are IVLS and IISLS (or 2SLS, if you prefer 
Arabic numerals). The method requires an n x q matrix of instrumental or 
exogenous variables, with n > q > p. The matrix will be denoted Z. The 
matrices Z'X and Z'Z need to be of full rank, p and q respectively. If g > p, 
the system is over-identified. If q = p, the system is just-identified. If 
q < p, the case which is excluded by assuming q > p, the system is under- 
identified—parameters will not be identifiable (section 7.2). Let’s make a 
cold list of the assumptions. 
(i) X isnx pand Z isnxq withn >q > p. 

(ii) Z’X and Z’Z have full rank, p and q respectively. 

Gii) Y= XB + ô. 

(iv) The ô; are IID, with mean 0 and variance o?. 

(v) Z is exogenous, i.e., Z Il ô. 
Assumptions (i) and (ii) are easy to check from the data. The others are 
substantially more mysterious. 

The idea behind TVLS is to multiply both sides of (4) by Z’, getting 


(5) Z'Y = Z'XB + Z'ô. 


This is a least squares problem. The response variable is Z'Y. The design 
matrix is Z'X and the error term is Z’5. The parameter vector is still 8. 
Econometricians use GLS (example 5.1, p. 65) to estimate (5), rather 
than OLS. This is because cov(Z'5|Z) = o° Z'Z # o?lqxq (exercise 3C4). 
Assumptions (i)-(ii) show that Z’Z has an inverse; and the inverse has a square 
root (exercise B1 below). We multiply both sides of (5) by (Z’ Z)—'/? to get 


(6) KZZ PZY] = [(Z'Z) 1 Z'X]8 +n, where n = (Z'Z)~'/Z'6. 


Apart from a little wrinkle to be discussed below, equation (6) is the 
usual regression model. As far as the errors are concerned, 


(7) E(n|Z) =0 


because Z was assumed exogenous: see (iv)-(v). (You want to condition on 
Z not X, because the latter is endogeneous.) Moreover, 


(8) cov(n|Z) = E[(Z'Z)"/?2'88'Z(Z'Z)'/? |Z] 
= (Z'Z)?2' E[88'|Z]Z(Z'Z) |? 
= (Z' ZP Z'o? InxnZ(Z'Z 
=o0°(Z'Z)~'/2(2'Z)(Z'Z) 1/7 


8 
= 0° lqxq.- 
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The big move is in the third line: E[68'|Z] = 07 Jy, because Z was as- 
sumed to be exogenous, and the ô; were assumed to be IID with mean 0 and 
variance a7: see (iv)-(v). Otherwise, we’re just factoring constants out of the 
expectation and juggling matrices. 

The OLS estimate for £ in (6) is 
(9) B = (M'M)`' M'L, 


where M = (Z'Z)~'/?Z’X is the design matrix and L = (Z'Z)~'/?Z’'Y is 
the response variable. (Exercise B1 shows that all the inverses exist.) 
The IVLS estimator in the original system (4) is usually given as 


(10) Bis = Bedcvamrs dite X'Z(Z'Z) 'Z'Y. 

We will show that B IVLS = B, completing the derivation of the IVLS estima- 
tor. This takes a bit of algebra. For starters, because Z'Z is symmetric, 

(11) MM =X'Z(Z'2)-" (2'2Z)- P ZX = X'Z(ZZ)“ ZX, 

and 

(12) ML = X'Z(Z'Z P (Z'Z P Z'Y =X Z(ZZ) ZY. 


Substituting (11) and (12) into (9) proves that Bris = B. 
Standard errors are estimated using (13-14): 


1 


(13) COV (Ês Z) = 6? [X'Z(Z D ZX] 
where 
(14) 6? = |Y — XBwisll’/( — p). 


Exercise C6 below provides an informal justification for definitions (13)- 
(14), and theorem 1 in section 8 has some rigor. It is conventional to divide 
by n — p in (14), but theorem 4.4 does not apply because we’re not in the 
OLS model: see the discussion of “the little wrinkle,” below. 

Equation (10) is pretty dense. For some people, it helps to check that all 
the multiplications make sense. For instance, Z is n xq, so Z’ is q xn. Then 
Z'Z and (Z'Z)~! are q xq. Next, X is nx p, so X’ is pxn. Thus, X’Z is 
pxq and Z'X is q x p, which makes X’Z(Z'Z)~!Z'X a px p matrix. What 
about X’Z(Z'Z)"!Z'Y? Well, X'Z is pxq, (Z'Z)"|isqxq, and Z'Y is 
qx 1. So X'Z(Z'Z)'Z'Y is px1. This is pretty dense too, but there is a 
simple bottom line: Bis is p x 1, like it should be. 
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Identification. The matrix equation (5) unpacks to g ordinary equations 
in p unknowns—the components of 6. (i) Ifg > p, there usually won’t be any 
vector £ that satisfies (5) exactly. GLS gives a compromise solution Bis. 
Gi) If q = p, there is a unique solution, which is Ênis: see exercise C5 
below. (iii) If g < p, we don’t have enough equations relative to the number 
of parameters that we are estimating. There will be many f’s satisfying (5). 
That is the tipoff to under-identification. 

The little wrinkle in(6). Given Z, the design matrix M = (Z'Z)~'/?Z'X 
is still related to the errors n = (Z’Z)~!/*Z’8, because of the endogeneity of 
X. This leads to small-sample bias. However, with luck, M will be practically 
constant, and a little bit of correlated randomness shouldn’t matter. Theorem 1 
in section 8 will make these ideas more precise. 


Exercise set B 

1. By assumptions (i)-(ii), Z’X is qx p of rank p, and Z’Z is q xq of rank 

q. Show that: 
(a) Z’Z is positive definite and invertible; the inverse has a square root. 
(b) X’Z(Z'Z)—'Z’X is positive definite, hence invertible. Hint. Sup- 
pose cis px 1. Can c'X'Z(Z'Z)!Z'Xc < 0? 

Note. Without assumptions (i)-(ii), equations (10) and (13) wouldn’t 

make sense. 


2. Let U; be IID random variables. Let U = 1 Xi U;. True or false, 
and explain: 


(a) E(U;) is the same for all i. 

(b) var(U;) is the same for all i. 

(© E(U;) =U. 

(d) var(Uj) = $; (Ui — UY. 
(e) var(Ui) = ay Dik Ui — UY. 


9.3 Estimating the butter model 


Our next project is to estimate the butter model using IVLS. We’ll start 
with the supply equation (2a). The equation is often written this way: 


(15) Q: = ao +a, P; + a2W; + a3 Hi + ôt for t=1,...,20. 


The actual price and quantity in year t are substituted for the free variables 
Q and P that define the supply schedule. Reminder: according to the law 
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of supply and demand in the model, Q; and P; were obtained by solving the 
pair of equations (2a)-(2b) for the two unknowns Q and P. 

Let’s get (15) into the format of (4). The response variable Y is the 
20 x 1 column vector of Q;’s, and ô is just the column of 6;’s. To get B, we 
stack up do, a1, a2, a3. The design matrix X is 20x4. Column 1 is all 1’s, 
to accommodate the intercept. Then we get a column of P,’s, a column of 
W,’s, and a column of H;’s. Column 1 is constant, and must be exogenous. 
Columns 3 and 4 are exogenous by assumption. But column 2 is endogenous. 
That’s the new problem. 

To get the matrix Z of exogenous variables, we start with columns 1, 3, 
and 4 in X. But we need at least one more instrument, to make up for the 
column of prices. Where to look? The answer is, in the demand equation. 
Just add a column of F;’s anda column of O;’s. Both of these are exogenous, 
by assumption. Now q = 5, and we’re good to go. The demand equation is 
handled the same way: the extra instruments come from the supply equation. 

Our model is a hypothetical, but one of the first applications of [VLS was 
to estimate supply and demand equations for butter (Wright 1928, p. 316). 
See Angrist and Krueger (2001) for discussion. 


Exercise set C 


1. An economist is specifying a model for the butter market in Illinois. She 
likes the model that we used for Wisconsin. She is willing to assume that 
the determinants of supply (wage rates and hay prices) are exogenous; 
also that the determinants of demand (prices of bread and olive oil) are 
exogenous. After reading sections 1—2 and looking at equation (10), she 
wants to use OLS not IVLS, and is therefore willing to assume that P, 
is exogenous. What is your advice? 

2. Lete = Y — XBtvis be the residuals from IVLS. True or false, and 
explain: 

(b) el X. 
(c) IYI? = |XBrvisll* + lel’. 
(d) 6? = llell2/(n — p). 

3. Which is smaller, ||Y — XBrts|l2 or ||Y — XBots||2? Discuss briefly. 
Is Brits biased or unbiased? What about ô? = ||Y — X Brivis I? /(n— p) 
as an estimator for o°? 

5. (Hard.) Verify that Bis = (Z'X)~'Z’Y in the just-identified case 
(q = p). In particular, OLS is a special case of IVLS, with Z = X. 
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6. (Hard.) Pretend Z’X is constant. To motivate definition (13), show that 
cov(Aivis|Z) = 0? [X'Z(Z'Z)1Z'X] 


9.4 What are the two stages? 


In the olden days, the model (4) was estimated in two stages. 


Stage I. Regress X on Z. (This first-stage regression can be done one 
column at a time.) The fitted values are X = Zŷ, where ŷ = (Z/Z)“!Z'X. 


Stage II. Regress Y on Xx. 


In short, 


-1 5; 


(16) Busts = (X’X) X' Y. 


The idea: X is almost a function of Z, and has been “purged” of endogeneity. 

By slightly tedious algebra, Busts = Brv_s. To begin the argument, let 
Hz = Z(Z'Z)~'Z’. The IVLS estimator in (10) can be rewritten in terms of 
Hz as 


(17) Bivis = (X’HzX)"'X'HzY. 

Since Hz is a symmetric idempotent matrix (section 4.2), 
X'HzX = (HzX)'(HzX) and X'HzY = (AzX)'Y. 

Substitute into (17): 

(18) Bits = [(HzX) (Hz X) "(Hz X)'Y. 


According to (18), regressing Y on HzX gives BrLs. But that is also the 
recipe for Busts: the fitted values in Stage I are Hz X = Xx , because Hz is 
the hat matrix which projects onto the column space of Z. The proof that 
Bus_s = BIVLS is complete. ae 

Likewise, Cov in (13)-(14) is 67X’X. If you just sit down and run 
regressions, however, you may get the wrong SEs. The computer estimates 
o° as |Y — X Êĝnsrsll?/(n — p), but you want ||Y — XBuszs|l?/( — p), 
without the hat on the X. The fix is easy, once you know the problem: 
compute the residuals as Y — X Busts. The algebra may be a little intricate, 
but the message of this section is simple: old-fashioned IISLS coincides with 
new-fangled IVLS. 
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Invariance assumptions 


Invariance assumptions need to be made in order to draw causal conclu- 
sions from non-experimental data: parameters are invariant—unchanging— 
under interventions, and so are errors or their distributions (sections 6.4—5). 
Exogeneity is another concern. Ina real example, as opposed to a hypothetical 
about butter, real questions would have to be asked about these assumptions. 
Why are the equations “structural,” in the sense that the required invariance 
assumptions hold true? Applied papers seldom address such assumptions, or 
the narrower statistical assumptions: for instance, why are errors IID? 

The tension here is worth considering. We want to use regression to draw 
causal inferences from non-experimental data. To do that, we need to know 
that certain parameters and certain distributions would remain invariant if we 
were to intervene. Invariance can seldom be demonstrated experimentally. If 
it could, we probably wouldn’t be discussing invariance assumptions, at least 
in that application. What then is the source of the knowledge? 

“Economic theory” seems like a natural answer, but an incomplete one. 
Theory has to be anchored in reality. Sooner or later, invariance needs em- 
pirical demonstration, which is easier said than done. Outside of economics, 
the situation is perhaps even less satisfactory, because theory is less well de- 
veloped, interventions are harder to define, and the hypothetical experiments 
are murkier. 


9.5 A social-science example: education and fertility 


Simultaneous equations are often used to model reciprocal causation—U 
influences V, and V influences U. Here is an example. Rindfuss et al (1980) 
propose a simultaneous-equations model to explain the process by which a 
woman decides how much education to get, and when to have children. The 
authors’ explanation is as follows. 


“The interplay between education and fertility has a significant influ- 
ence on the roles women occupy, when in their life cycle they occupy these 
roles, and the length of time spent in these roles. . . . This paper explores the 
theoretical linkages between education and fertility. ... It is found that the 
reciprocal relationship between education and age at first birth is dominated 
by the effect from education to age at first birth with only a trivial effect in 
the other direction. 

“No factor has a greater impact on the roles women occupy than mater- 
nity. Whether a woman becomes a mother, the age at which she does so, and 
the timing and number of subsequent births set the conditions under which 
other roles are assumed. . . . Education is another prime factor conditioning 
female roles. ... 
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“The overall relationship between education and fertility has its roots at 
some unspecified point in adolescence, or perhaps even earlier. At this point 
aspirations for educational attainment as a goal in itself and for adult roles 
that have implications for educational attainment first emerge. The desire for 
education as a measure of status and ability in academic work may encourage 
women to select occupational goals that require a high level of educational 
attainment. Conversely, particular occupational or role aspirations may set 
standards of education that must be achieved. The obverse is true for those 
with either low educational or occupational goals. Also, occupational and 
educational aspirations are affected by a number of prior factors, such as 
mother’s education, father’s education, family income, intellectual ability, 
prior educational experience, race, and number of siblings. .. .” 


Rindfuss et al (their paper is reprinted at the back of the book) use a 


simultaneous-equations model, with variables defined in table 1 below. There 
are two endogenous variables, ED and AGE. The exogenous variables are 


Table 1. Variables in the model (Rindfuss et al 1980). 


The endogenous variables 


ED Respondent’s education 
(Years of schooling completed at first marriage) 
AGE Respondent’s age at first birth 
The exogenous variables 
OCC Respondent’s father’s occupation 
RACE Race of respondent (Black = 1, other = 0) 
NOSIB Respondent’s number of siblings 
FARM Farm background (coded 1 if respondent grew up 
on a farm, else coded 0) 
REGN Region where respondent grew up (South = 1, other = 0) 
ADOLF Broken family (coded 0 if both parents present 
when respondent was 14, else coded 1) 
REL Religion (Catholic = 1, other = 0) 
YCIG Smoking (coded 1 if respondent smoked before age 16, 
else coded 0) 
FEC Fecundability (coded 1 if respondent had 


a miscarriage before first birth; else coded 0) 


Notes: The data are from a probability sample of 1766 women 35—44 
years of age residing in the continental United States. The sample was 
restricted to ever-married women with at least one child. OCC was 
measured on Duncan’s scale (section 6.1), combining information on 
education and income. Notation differs from Rindfuss et al. 
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OCC, ..., FEC. The notes to the table describe the sample survey that col- 
lected the data. The model consists of two linear equations in the two un- 
knowns, ED and AGE: 


(19a) ED = aọ + a} AGE + a2OCC; + a3RACE; +--+ + ajo YCIG; + ôi, 
(19b) AGE = bo + DED + b2FEC; + b3RACE; + --- + big YCIG; + Ei. 


According to the model, a woman—indexed by the subscript i—chooses 
her educational level ED; and age at first birth AGE; as if by solving the two 
equations for the two unknowns. These equations are response schedules 
(sections 6.4-5). The ao, a1, ..., bo, b1, ... are parameters, to be estimated 
from the data. The terms in OCC;, FEC;, . . . , YCIG; take background factors 
into account. The random errors (6;, €;) are assumed to have mean 0, and (as 
pairs) to be independent and identically distributed from woman to woman. 

The model allows ô; and €; to be correlated; ô; may have a different 
distribution from ¢€;. Rindfuss et al use two-stage least squares to fit the 
equations. Notice that they have excluded FEC from equation (19a), and 
OCC from equation (19b). Without these identifying restrictions, the system 
would be under-identified (section 2 above). 

The main empirical finding is this. The estimated coefficient of AGE in 
(19) is not statistically significant, i.e., a} could be zero. The woman who 
dropped out of school because she got pregnant at age 16 would have dropped 
out anyway. By contrast, by is significant. The causal arrow points from ED 
to AGE, not the other way. This finding depends on the model. When looked 
at coldly, the argument may seem implausible. A critique can be given along 
the following lines. 


(i) Assumptions about the errors. Why are the errors independent and 
identically distributed across the women? Independence may be 
reasonable, but heterogeneity is more plausible than homogeneity. 


(ii) Omitted variables. Important variables have been omitted from 
the model, including two that were identified by Rindfuss et al 
themselves—aspirations and intellectual ability. (See the quotes at 
the beginning of the section.) Since Malthus (1798), it has been 
considered that wealth is an important factor in determining educa- 
tion and marriage. Wealth is not in the model. Social class matters, 
and OCC measures only one of its aspects. 


(iii) Why additive linear effects? 

(iv) Constant coefficients. Rindfuss et al are assuming that the same 
parameters apply to all women alike, from poor blacks in the cities 
of the Northeast to rich whites in the suburbs of the West. Why? 
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(v) Are FEC, OCC, and so forth really exogenous? 
(vi) What about the identifying restrictions? 
(vii) Are the equations structural? 


It is easier to think about questions (v—vii) in the context of a model 
that restricts attention to a more homogeneous group of women, where the 
only relevant background factors are OCC and FEC. The response schedules 
behind the model are as follows. 


(20a) ED = c + a1 AGE + a2OCC + ô, 
(20b) AGE = d + b\ED + b)FEC + €. 


What do these assumptions really mean? Two hypothetical experiments 
help answer this question. In both experiments, fathers are assigned to jobs; 
and daughters are assigned to have a miscarriage before giving birth to their 
first child (FEC = 1), or not to have a miscarriage (FEC = 0). 


Experiment #1. Daughters are assigned to the various levels of AGE. 
ED is observed as the response. In other words, the hypothetical exper- 
imenter chooses when the woman has her first child, but allows her to 
decide when to leave school. 


Experiment #2. Daughters are assigned to the various levels of ED. 
Then AGE is observed as the response. The hypothetical experimenter 
decides when the woman has had enough education, but lets her have a 
baby when she wants to. 


The statistical terminology is rather dry. The experimenter makes fathers 
do one job rather than another: surgeons cut pastrami sandwiches and taxi 
drivers run the central banks. Women are made to miscarry at one time and 
have their first child at another. 

The equations can now be translated. According to (20a), in the first 
experiment, ED does not depend on FEC. (That is one of the identifying 
restrictions assumed by Rindfuss et al.) Moreover, ED depends linearly 
on AGE and OCC, plus an additive random error. According to (20b), in 
the second experiment, AGE does not depend on OCC. (That is the other 
identifying restriction assumed by Rindfuss et al.) Moreover, AGE depends 
linearly on ED and FEC, plus an additive random error. Even for thought 
experiments, this is a little fanciful. 

We return now to the full model, equations (19a)-(19b). The data were 
collected in a sample survey, not an experiment (notes to table 1). Rindfuss 
et al must be assuming that Nature assigned OCC, FEC, RACE, . . . indepen- 
dently of the disturbance terms ô and € in (19a) and (19b). That assumption is 
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what makes OCC, FEC, RACE, . . . exogenous. Rindfuss et al must further be 
assuming that women chose ED and AGE as if by solving the two equations 
(19a) and (19b) for the two unknowns, ED and AGE. Without this assumption, 
simultaneous-equation modeling seems irrelevant. (The comparable element 
in the butter model is the law of supply and demand.) 

The equations estimated from the survey data should also apply to exper- 
imental situations where ED and AGE are manipulated. For instance, women 
who freely choose their educational levels and their times to have children 
should do so using the same pair of equations—with the same parameter val- 
ues and error terms—as women made to give birth at certain ages. These con- 
stancy assumptions are the basis for causal inference from non-experimental 
data. The data analysis in the paper doesn’t justify such assumptions. How 
could it? 

Without the response schedules that embody the constancy assumptions, 
it is hard to see what “effects” might mean, apart from slopes of a plane that 
has been fitted to survey data. It would remain unclear why planes should be 
fitted by two-stage least squares, or what role the significance tests are playing. 
Rindfuss et al have an interesting question, and there is much wisdom in 
their paper. But they have not demonstrated a connection between the social 
problem they are studying and the statistical technique they are using. 

Simultaneous equations that derive from response schedules are struc- 
tural. Structural equations hold for the observational studies in which the 
data were collected—and for the hypothetical experiments that usually re- 
main behind the scenes. Unless equations are structural, they have no causal 
implications (section 6.5). 


More on Rindfuss et al 


Rindfuss et al make arguments to support their position, but their at- 
tempts to justify the identifying restrictions look artificial. Exogeneity as- 
sumptions are mentioned in Rindfuss and St. John (1983); however, a critical 
step is missing. Variables labeled as “instrumental” or “exogenous,” like 
OCC, FEC, RACE, ..., need to be independent of the error terms. Why 
would that be so? 

Hofferth and Moore (1979, 1980) obtain different results using different 
instruments, as noted by Hofferth (1984). Rindfuss et al (1984) say that 


“instrumental variables... . require strong theoretical assumptions... . 
and can give quite different results when alternative assumptions are 
made. . . . itis usually difficult to argue that behavioral variables are truly 
exogenous and that they affect only one of the endogenous variables but 
not the other.” [pp. 981-82] 
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Thus, results depend quite strongly on assumptions about identifying 
restrictions and exogeneity, and there is no good way to justify one set of 
assumptions rather than another. Bartels (1991) comments on the impact of 
exogeneity assumptions and the difficulty of verification. Also see Altonji 
et al (2005). Rindfuss and St. John (1983) give useful detail on the model. 
There is an interesting exchange between Geronimus and Korenman (1993) 
and Hoffman et al (1993) on the costs of teenage pregnancy. 


9.6 Covariates 


In the butter hypothetical, we could take the exogenous variables as 
non-manipulable covariates. The assumption would be that Nature chooses 
(Wr, Hi, Fi, Or) : t = 1,...,20 independently of the random error terms 
(ôr, €) :f =1,..., 20. 

The error terms would still be assumed IID (as pairs) with mean 0, and 
a 2x2 covariance matrix. We still have two hypothetical experiments: (i) set 
the price P to farmers, and see how much butter comes to market; (ii) set the 
price P to consumers and see how much butter is bought. By assumption, 
the answer to (i) is 


(21a) Q = ao + a1 P + aW; + 43H; + ôt, 
while the answer to (ii) is 


(21b) Q = bo + bı P + b2F; + b30; + €. 


For the observational data, we would still need to assume that Q; and P, in 
year t are determined as if by solving (21a) and (21b) for the two unknowns, 
Q and P, which gets us back to (2a) and (2b). 

With Rindfuss et al, OCC, FEC, RACE, ... could be taken as non- 
manipulable covariates, eliminating some of the difficulty in the hypothetical 
experiments. The identifying restrictions—FEC is excluded from (19a) and 
OCC from (19b)—remain mysterious, as does the assumed linearity. How 
could you verify such assumptions? 

Often, “covariate” just means a right hand side variable in a regression 
equation—especially if that variable is only included to control for a possible 
confounder. Sometimes, “covariate” signifies a non-manipulable characteris- 
tic, like age or sex. Non-manipulable variables are occasionally called “con- 
comitants.” To make causal inferences from observational data, we would 
have to assume that statistical relations are invariant to interventions: the 
equations, the coefficients, the random error terms, and the covariates all stay 
the same when we start manipulating the variables we can manipulate. 
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9.7 Linear probability models 


Schneider et al (1997) use two-stage least squares—with lots of bells and 
whistles—to study the effects of school choice on social capital. (The paper 
is reprinted at the back of the book; also see Schneider et al 2002.) “Linear 
probability models” are used to control for confounders and self-selection. 
The estimation strategy is quite intricate. Let’s set the details aside, and think 
about the logic. First, here is what Schneider et al say they are doing, and 
what they found: 


“While the possible decline in the level of social capital in the United 
States has received considerable attention by scholars such as Putnam and 
Fukuyama, less attention has been paid to the local activities of citizens that 
help define a nation’s stock of social capital . . . . giving parents greater choice 
over the public schools their children attend creates incentives for parents as 
“citizen/consumers’ to engage in activities that build social capital. Our em- 
pirical analysis employs a quasi-experimental approach .... the design of 
governmental institutions can create incentives for individuals to engage in 
activities that increase social capital . . . . active participation in school choice 
increases levels of involvement with voluntary organizations. ... School 
choice can help build social capital.” 


Social capital is a very complicated concept, and quantification is more of 
a challenge than Schneider et al are willing to recognize. PTA membership— 
one measure of social capital, according to Schneider et al—is closer to ground 
level. (PTA means Parent-Teachers Association.) Schneider et al suggest that 
school choice promotes PTA membership. They want to prove this by running 
regressions on observational data. We’ll look at results in their tables 1-2. 

The analysis involves about 600 families with children in school in New 
York school districts 1 and 4. Schneider et al find that “active choosers” are 
more likely to be PTA members, other things being equal. Is this causa- 
tion, or self-selection? The sort of parents who exercise choice might be the 
sort of parents who go to PTA meetings. The investigators use a two-stage 
model to correct for self-selection—like Evans and Schwab, but with a linear 
specification instead of probits. 

There are statistical controls for universal choice, dissatisfaction, school 
size, black, Hispanic, Asian, length of residence, education, employed, fe- 
male, church attendance (table 2 in the paper). School size, length of res- 
idence, and education are continuous variables. So is church attendance: 
frequency of attendance is scaled from 1 to 7. The other variables are all 
dummies. “Universal choice” is 1 for families in district 4, and 0 in district 1. 
“Dissatisfaction” is 1 if the parents often think about moving the child to 
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another school, and 0 otherwise: note 12 in the paper. The statistical controls 
for family i are denoted W;. (The paper uses different notation.) 

The dummy variable Y; is 1 if family i exercises school choice. There is 
another dummy Z; for PTA membership. The object is to show that in some 
sense, Y; influences Z;. There are two instrumental variables, both dummy 
variables: when choosing a school, did the parents think its values mattered? 
did they think the diversity of the student body mattered? The instrumental 
variables for family i are denoted X;. 


The assumptions 


Each family (indexed by 7) has a pair of latent variables (U;, Vi), with 
E(U;) = E(V;) = 0. The (U;, V;) are taken as IID across families i, but U; 
and V; may be correlated. The (U;, V;) are supposed to be independent of 
the (X;, W;). Equations (22)-(23) represent the social physics: 


(22) P{Y; =1|X,W,U,V} = Xia + Wib + Uj, 
(23) P{Z; =1|Y,X,W,U,V}=cY; + Wid + Vi. 


Here, X is the n x 2 matrix whose ith row is X;, and so forth. Given 
X, W, U, V, the response variables (Y;, Z;) are independent in i. 

Equation (22) is an “assignment equation.” The assignment equation 
says how likely it is for family i to exercise school choice. Equation (23) 
explains Z; in terms of Y;, X;, W; and the latent variables U;, V;. (Remember, 
Y; = 1 if family i exercises school choice, and Z; = 1 if the parents are PTA 
members.) The crucial parameter in (23) is c, the “effect” of active choice 
on PTA membership. This c is scalar; a, b,d are vectors because X;, W; 
are vectors. Equations (22) and (23) are called “linear probability models:” 
probabilities are expressed as linear combinations of control variables, plus 
latent variables that are meant to capture unmeasured personal characteristics. 
In the bivariate probit model for Catholic schools, the assignment equation is 
(7.9) and the analog of (23) is (7.4). 

Equations (1) and (2) in the paper look different from (22) and (23). 
They are different. In (1), well, Schneider et al aren’t distinguishing between 
Y = 1 and P(Y = 1). Equation (2) in the paper has the same defect. 
Furthermore, the equation is part of a fitting algorithm rather than a model. 
The algorithm involves two-stage least squares. That is why “predicted active 
chooser” appears on the right hand side of the equation. (“Active choosers” 
are parents who exercise school choice: these parents choose a school for 
their children other than the default local public school.) 

Figure 3 is the graphical counterpart of equations (22)-(23). The arrows 
leading into Y represent the variables on the right hand side of (22); the arrows 
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Figure 3. PTA membership explained. 


U, V Correlated errors -7 ae 
X Instruments 

Y Active chooser 

Z PTA 

W Statistical controls 


Ww 


leading into Z represent the variables on the right hand side of (23). The dotted 
line connecting U and V represents the (unknown) correlation between the 
disturbance terms in the two equations. There is no arrow from X to Z: by 
assumption, X is excluded from (23). There are no dotted lines connecting 
the disturbance terms to X and W: by assumption, the latter are exogenous. 

The vision behind (22) and (23) is this. Nature chooses (U;, V;) as IID 
pairs from a certain probability distribution, which is unknown to us. Next— 
here comes the exogeneity assumption—Nature chooses the X;’s and W;’s, 
independently of the U;’s and V;’s. Having chosen all these variables, Nature 
then flips a coin to see if Y; = 0 or 1. According to (22), the probability 
that Y; = 1 is X;a + W;b + U;. Nature is supposed to take the Y; she just 
generated, and plug it into (23). Then she flips a coin to see if Z; = 0 or 1. 
According to (23), the probability that Z; = 1 is cY; + Wid + Vj. 

We do not get to see the parameters a, b, c,d or the latent variables 
U;, V;. All we get to seeis X;, W;, Yi, Zi. Schneider et al estimate c by some 
complicated version of two-stage least squares: ¢ = 0.128 and SE = 0.064, 
so t = 0.128/0.064 = 2 and P = 0.05. (See table 2 in the paper.) School 
choice matters. QED. 


The questions 


This paper leaves too many loose ends to be convincing. Why are the 
variables used as instruments independent of the latent variables? For that 
matter, what makes the control variables independent of the latent variables? 
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Why are the latent variables IID across subjects? Where does linearity come 
from? Why are the parameters a, b, c,d the same for all subjects? What 
justifies the identifying restriction—no X on the right hand side of (23)? 

The questions keep coming. Table B1 indicates that the dummies for 
dissatisfaction and district 4 were excluded from the assignment equation; so 
was school size. Why? There are 580 subjects in the PTA model (table 1 in 
Schneider et al). What about the other 400 + 401 — 580 = 221 respondents 
(table A1)? Or the 113 + 522 + 225 + 1642 = 2502 non-respondents? At a 
more basic level, what intervention are Schneider et al talking about? After 
all, you can’t force someone to be an “active chooser.” And what suggests 
stability under interventions? As with previous examples (Evans and Schwab, 
Rindfuss et al) there is a disconnect between the research questions and the 
data processing. 


Exercise set D 


Schneider et al is reprinted at the back of the book. The estimated coefficient 
for school size reported in table 2 is —0.000; i.e., the estimate was somewhere 
between 0 and —0.0005. When doing exercises 1 and 2, you may assume the 
estimate is —0.0003. 


1. Using the data in table 2 of Schneider et al, estimate the probability 
that a respondent with the following characteristics will be a PTA mem- 
ber: (i) active chooser, (ii) lives in district 1, (iii) dissatisfied, (iv) child 
attends a school which has 300 students, (v) black, (vi) lived in dis- 
trict 1 for 11 years before survey, (vii) completed 12 years of schooling, 
(viii) employed, (ix) female, (x) atheist—never goes to church—never!! 


2. Repeat, for a respondent who is not an active chooser but has otherwise 
the same characteristics as the respondent in exercise 1. 


3. What is the difference between the numbers for the two respondents in 
exercises 1 and 2? How do Schneider et al interpret the difference? 


4. Given the model, the numbers you have computed for the two respon- 
dents in exercises 1 and 2 are best interpreted as . Options: 


probabilities estimated probabilities estimated expected probabilities 
5. What is it in the data that makes the coefficient of school size so close 
to 0? (For instance, would —0.3 be feasible?) 
Do equations (1) and (2) in the paper state the model? 


7. (a) Does table 1 in Schneider et al show the sample is representative or 
unrepresentative? 
(b) What percentage of the sample had incomes below $20,000? 
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(c) Why isn’t there an income variable in table 2? table B1? 

(d) To what extent have Schneider et al stated the model? the statistical 
assumptions? 

(e) Are Schneider et al trying to estimate the effect of an intervention? 
If so, what is that intervention? 


9.8 More on IVLS 


This section looks at some fine points in the theory of IVLS. Exercise 
set E is hard, but depends only on the material in sections 2—4. After the exer- 
cises, there are some computer simulations to illustrate the twists and turns; 
IVLS is described in the multivariate normal case. There are suggestions for 
further reading. 


Some technical issues 


(i) Initially, more instruments may be better; but if q is too close to n, 
then X = X and IISLS may not do much purging. 


(ii) The OLS estimator has smaller variance than IVLS, sometimes to 
the extent that OLS winds up with smaller mean squared error than IVLS: 


(simultaneity bias)* + OLS variance < (small-sample bias)* + IVLS variance. 


There is a mathematical inequality for the asymptotic variance-covariance 
matrices: 


cov (Bois|X) < COV (Bivis|Z) 


where A < B means that B — A is non-negative definite. As noted in exercise 
C3, OLS has the smaller ĉ?. Next, Z(Z’Z)~!Z’ is the projection matrix onto 
the columns of Z, so 


ZZ Z)"*Z! < Inxn, 

X'Z(Z'ZİZ'X < X'InxnX = XX, 

[MAA Fa > (X'X)!. 
Equation (13) completes the argument. 


Gii) If the instruments are only weakly related to the endogenous vari- 
ables, the randomness in Z’X can be similar in size to the randomness in X. 
Then small-sample bias can be quite large—even when the sample is large 
(Bound et al 1995). 


(iv) If Z'Z is nearly singular, that can also make trouble. 
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(v) Even after conditioning on Z, the means and variances of matrices 
like 
[MAG ZZ 
can be infinite—due to the inverses. That is one reason for talking about 
“asymptotic” means and variances. 


(vi) Theoretical treatments of IVLS usually assume that n is large, p and 
q are relatively small, Z’Z = nA and Z'X = nB, where A is q x q positive 
definite and B is q x p of rank p. Difficulties listed above are precluded. The 
IVLS estimator given by (10) is asymptotically normal; the asymptotic mean 
is 6 and the asymptotic covariance is given by (13)-(14). Here is a formal 
result, where N(Opx1, [px p) denotes the joint distribution of p independent 
N(O, 1) variables. 


THEOREM 1. Let Z; be the ith row of Z, and let X; be the ith row of 
X. Suppose that the triplets (Z;, X;, ôi) are IID; that each random variable 
has four moments; that Z; 1. 6;; that E(6;) = 0; that Y; = X;ß + ôi; that 
E(Z;Z,) is non-singular and E(Z’ X,) has rank p. Then 


6—'[x'Z(Z'Z)!z'x]"" (Bivis — B) 
is asymptotically N(Opx1, Ipxp) as n gets large. 


Example 1. The scalar case. Let (Z;, Xi, ôi) be IID triplets of scalar 
random variables fori = 1, ...,n. Each random variable has four moments, 
and E(Z;X;) > 0. Assume E(6;) = 0 and Z; 1. 6;. Let Y; = BX; + ôi. We 
wish to estimate 6. In this model, X; may be endogeneous. On the other 
hand, we can instrument X; by Z;, because Z; || 6;. Theorem 1 can be proved 
directly. First, ÊivLs = = 0 2,7; / ÈX; Zi X; by exercise C5. Now substitute 
BX; + ôi for Y; to see that ÊrvLs — B = Xi; Ziôi [2i Z;,X;. The Z;ô; are UD 
and E(Zj;6;) = 0, so X; Z;6;/./n is asymptotically normal by the central 
limit theorem. Furthermore, Z; X; are IID and E(Z;X;) > 0, so ÈX; Z;X;/n 
converges to a finite positive limit by the law of large numbers. For details 
and an estimate of small-sample bias, see 


http://www.stat.berkeley.edu/users/census/ivls.pdf 


Exercise set E 


1. Achance for bonus points. Three investigators are studying the following 
model: Y; = X;ß + <; fori = 1,...,n. The random variables are all 
scalar, as is the unknown parameter 8. The unobservable e; are IID 
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with mean 0 and finite variance, but X is endogenous. Fortunately, 
the investigators also have an n x 1 vector Z, which is exogenous and 
not orthogonal to X. Investigator #1 wishes to fit the model by OLS. 
Investigator #2 wants to regress Y on X and Z; the coefficient of X in 
this multiple regression would be the estimator for 6. Investigator #3 
suggests B = Z'Y/Z'X. Which of the three estimators would you 
recommend? Why? What are the asymptotics? To focus the discussion, 
assume that (X;, Yi, Zi, €i) are IID four-tuples, jointly normal, mean 0, 
and var(X;) = var(Z;) = 1. Assume too that n is large. As a matter of 
notation, Y; is the ith component of the n x 1 vector Y; similarly for X. 


2. Another chance for bonus points. Suppose that (X;, Yi, Zi, €i) are in- 
dependent four-tuples of scalar random variables fori = 1,...,”, with 
a common jointly normal distribution. All means are 0 and n is large. 
Suppose further that Y; = X;6 + <i. The variables X;, Y;, Zi are ob- 
servable, and every pair of them has a positive correlation which is less 
than 1. However, €; is not observable, and £ is an unknown constant. 
Is the correlation between Z; and e; identifiable? Can Z be used as an 
instrument for estimating 6? Explain briefly. 


3. Last chance for bonus points. In the over-identified case, we could 
estimate o? by fitting (6) to the data, and dividing the sum of the squared 
residuals by g — p. What’s wrong with this idea? 


Simulations to illustrate IVLS 


Let (Zi, ôi, €i) be IID jointly normal with mean 0. Here, ô; and €; are 
scalars, but Z; is 1 xq, with q > 1. Assume Z; Il (6;, €;), the components 
of Zi are independent with variance 1, but cov(6;, €i) may not vanish. Let C 
be a fixed q x 1 matrix, with ||C|| > 0. Let X; = Z;C + ôi, a scalar random 
variable: in the notation of section 2, p = 1. The model is 


Y; = Xib + éi fori=1,...,n. 


We stack in the usual way: Y; is the ith component of the vector Y 
and <€; is the ith component of the vector €, while X; is the ith row of the 
matrix X and Z; is the ith row of the matrix Z. Thus, Z is exogenous (Z Jl €) 
and X is endogenous unless cov(ô;, €;) = 0. We can estimate the scalar 
parameter 8 by OLS or IVLS and compare the MSEs. Generally, OLS will 
be inconsistent, due to simultaneity bias; IVLS will be consistent. If n is 
small or ||C || is small, then small-sample bias will be an issue. We can also 
compare methods for estimating var(é;). 
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Ideally, ISLS would replace X; by Z;C. However, C isn’t known. So 
the estimator replaces X; by ZiC, with C obtained by regressing X on Z. 
Since X is endogenous, C is too: this is the source of small-sample bias. 
When n is large, Cc , and the problem goes away. If p > 1, then X; and 
6; should be 1x p, 6 should be px1, C should be q x p. We would require 
q => p andrank(C) = p. 

Terminology. As the sample size gets large, a consistent estimator con- 
verges to the truth; an inconsistent estimator does not. This differs from 
ordinary English usage. 


9.9 Discussion questions 
These questions cover material from previous chapters. 


1. An advertisement for a cancer treatment center starts with the headline 
“Celebrating Life with Cancer Survivors.” The text continues, 


“Did you know that there are more cancer survivors now than ever 
before? .... This means that life after a cancer diagnosis can be a 
reality. . .. we’re proud to be part of an improving trend in cancer 
survival. By offering the tools for early detection, as well as the 
most advanced cancer treatments available, we’re confident that the 
trend will continue.” 


Discuss briefly. What is the connection between earlier diagnosis and 
increasing survival time after diagnosis? 


2. CT (computerized tomography) scans can detect lung cancer very early, 
while the disease is still localized and treatable by a surgeon—although 
the efficacy of treatment is unclear. Henschke et al (2006) found 484 
lung cancers in a large-scale screening program, and estimated the 5- 
year survival rate among these patients as 85%. Most of the cancers 
were resected, that is, surgically removed. By contrast, among patients 
whose lung cancer is diagnosed when the disease becomes symptomatic 
(e.g., with persistent cough, recurrent lung infections, chest pain), the 
5-year survival rate is only about 15%. Do these data make a case for 
CT screening? Discuss briefly. 


3. Pisano et al (2005) studied the “diagnostic performance of digital versus 
film mammography for breast cancer screening.” About 40,000 women 
participated in the trial; each subject was screened by both methods. 


“(The trial] did not measure mortality endpoints. The assumption 
inherent in the design of the trial is that screening mammography 
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reduces the rate of death from breast cancer and that if digital mam- 
mography detects cancers at a rate that equals or exceeds that of 
film mammography, its use in screening is likely to reduce the risk 
of death by as much as or more than .. . film mammography.” 


There was little difference in cancer detection rates for all women. How- 
ever, for women with radiographically dense breasts (about half the sub- 
jects and many of the cancers), the detection rate was about 25% higher 
with digital mammography. This difference is highly significant. 


(a) Granting the authors’ design assumption, would you recommend 
digital or film mammography for women with radiographically 
dense breasts? for other women? 


(b) What do you think of the design assumption? 


4. Headlined “False Conviction Study Points to the Unreliability of Evi- 
dence,” the New York Times ran a story about the study, which 


“examined 200 cases in which innocent people served an average 
of 12 years in prison. A few types of unreliable trial evidence pre- 
dictably supported wrongful convictions. The leading cause of the 
wrongful convictions was erroneous identification by eyewitnesses, 
which occurred 79 percent of the time.” 


Discuss briefly. Is eyewitness evidence unreliable? What’s missing from 
the story? 


5. The New York Times ran a story headlined “Study Shows Marathons 
Aren’t Likely To Kill You,” claiming that the risk of dying on a marathon 
is twice as high if you drive it than if you run it. The underlying study 
(Redelmeier and Greenwald 2007) estimated risks for running marathons 
and for driving. The measure of risk was deaths per day. The study 
compared deaths per day from driving on marathon days to deaths per day 
from driving on control days without marathons. The rate on marathon 
days was lower. (Roads are closed during marathons; control days were 
matched to marathon days on day of the week, and the same time periods 
were used; data on traffic fatalities were available only at the county 
level.) The study concluded that 46 lives per day were saved by road 
closures, compared to 26 sudden cardiac deaths among the marathon 
runners, for a net saving of 20 lives. What’s wrong with this picture? 
Comment first on the study, then on the newspaper article. 


6. Prostate cancer is the most common cancer among American men, with 
200,000 new cases diagnosed each year. Patients will usually consult 
a urological surgeon, who recommends one of three treatment plans: 
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surgical removal of the prostate, radiation that destroys the prostate, or 
watchful waiting (do nothing unless the clinical picture worsens). A 
biopsy is used to determine the Gleason score of the cancer, measuring 
its aggressiveness; Gleason scores range from 2 to 10 (higher scores 
correspond to more aggressive cancers). The recommended treatment 
will depend to some extent on biopsy results. The side effects of surgery 
and radiation can be drastic, and efficacy is debatable. So, as the pro- 
fessionals say, “management of this cancer is controversial.” However, 
patients tend to accept the recommendations made by their urologists. 


Grace Lu- Yao and Siu-Long Yao (1997) studied treatment outcomes, us- 
ing data from the Surveillance, Epidemiology and End Results (SEER) 
Program. This is a cancer registry covering four major metropolitan ar- 
eas and five states. The investigators found 59,876 patients who received 
a diagnosis of prostate cancer during the period 1983—1992, and were 
aged 50-79 at time of diagnosis. For these cases, the authors estimated 
10-year survival rates after diagnosis. They chose controls at random 
from the population, matched controls to cases on age, and estimated 
10-year survival rates for the controls. Needless to say, only male con- 
trols were used. Results are shown in the table below for cases with 
moderately aggressive cancer (Gleason scores of 5-7). 


(a) How can the 10-year survival rate for the controls depend on treat- 
ment? 

(b) Why does the survival rate in the controls decline as you go down 

the table? 

In the surgery group, the cases live longer than the controls. Should 

we recommend surgery as a prophylactic measure? Explain briefly. 


(c 


wm 


(d 


wm 


The 10-year survival rate in the surgery group is substantially bet- 
ter than that in the radiation group or the watchful-waiting group. 
Should we conclude that surgery is the preferred treatment option? 


Explain briefly. 
10-year survival (%) 
Treatment Cases Controls 
Surgery 71 64 
Radiation 48 52 
Watchful waiting 38 49 


7. In 2004, as part of a program to monitor its first presidential election, 25 
villages were selected at random in a certain area of Indonesia. In to- 
tal, there were 25,000 registered voters in the sample villages, of whom 
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13,000 voted for Megawati: 13,000/25,000 = 0.52. True or false, and 
explain: the standard error on the 0.52 is ,/0.52 x 0.48/25. Or should 
it be /0.52 x 0.48/25,000? Discuss briefly. 


8. (Partly hypothetical.) Psychologists think that older people are happier, 
as are married people; moreover, happiness increases with income. To 
test the theory, a psychologist collects data on a sample of 1500 people, 
and fits a regression model: 


Happiness; = a+ bU; + cV; + dW; +6;, 


with the usual assumptions on the error term. Happiness is measured by 
self-report, on a scale from 0 to 100. The average is about 50, with an 
SD of 15. The dummy U; = 1 if subject i is over 35 years of age, else 
Ui = 0. Similarly, V; = 1 if subject i is married, else V; = 0. Finally, 
W; is the natural logarithm of subject i’s income. (Income is truncated 
below at $1.) Suppose for parts (b—-d) that the model is right. 


(a) What are the usual assumptions? 
(b) Interpret the coefficients b, c, d. What sign should they have? 


(c) Suppose that, in the sample, virtually all subjects over the age of 
35 are married; however, for subjects under the age of 35, about 
half are married and half are unmarried. Does that complicate the 
interpretation? Explain why or why not. 


(d) Suppose that, in the sample, virtually all subjects over the age of 35 
are married; further, virtually all subjects under the age of 35 are 
unmarried. Does that complicate the interpretation? Explain why 
or why not. 


(e 


wm 


According to the New York Times, “The [psychologists’] theory 
was built on the strength of rigorous statistical and mathematical 
modeling calculations on computers running complex algorithms.” 
What does this mean? Does it argue for or against the theory? 
Discuss briefly. 


9. Yule rana regression of changes in pauperism on changes in the out-relief 
ratio, with changes in population and changes in the population aged 65+ 
as control variables. He used data from three censuses and four strata of 
unions, the small geographical areas that administered poor-law relief. 
He made a causal inference: out-relief increases pauperism. To make 
this inference, he had to assume that some things remained constant 
amidst changes. Can you explain the constancy assumptions? 
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King, Keohane and Verba (1994) discuss the use of multiple regression 
to estimate causal effects in the social sciences. According to them, 


“Random error in an explanatory variable produces bias in the esti- 
mate of the relationship between the explanatory and the dependent 
variable. That bias takes a particular form: it results in the estima- 
tion of a weaker causal relationship than is the case.” [p. 158] 


Do you agree or disagree? Discuss briefly. (The authors have in mind a 
model where Y = Xf + e€, but the investigator observes X* = X + ô 
rather than X, and regresses Y on X*.) 


Ansolabehere and Konisky (2006) want to explain voter turnout Y;,; in 
county i and year t. Let X;,; be 1 if county i in year t required registration 
before voting, else 0; let Z; ; be a 1 x p vector of control variables. The 
authors consider two regression models. The first is 


(24) Yi = + Xit + Zity + ĉit 


where ô; is a random error term. The second is obtained by taking 
differences: 


(25) Yit — Yi t-1 = (Xit — Xi t-11) + (Zit — Zi t-1)Y + Git 


where e€; s is a random error term. The chief interest is in 6, whereas y is 
pxl vector of nuisance parameters. If (24) satisfies the usual conditions 
for an OLS regression model, what about (25)? And vice versa? 


An investigator fits a regression model Y = Xf + € to the data, and 
draws causal inferences from f. A critic suggests that 8 may vary from 
one data point to another. According to a third party, the critique—even 
if correct—only means there is “unmodeled heterogeneity.” 


(a) Why would variation in 6 matter? 


(b) Is the third-party response part of the solution, or part of the prob- 
lem? 


Discuss briefly. 


A prominent social scientist describes the process of choosing a model 
specification as follows. 


“We begin with a specification that is suggested by prior theory and 
the question that is being addressed. Then we fit the model to the 
data. If this produces no useful results, we modify the specification 
and try again, with the objective of getting a better fit. In short, the 
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14. 


15. 


16. 


17. 
18. 


19. 


20. 


21. 


22. 


initial specification is tested before being accepted as the correct 
model. Thus, the proof of specification is in the results.” 


Discuss briefly. 


A is assumed going into the data analysis; a is estimated 
from the data analysis. Options: 


(i) response schedule (ii) regression equation 


Causation follows from the ; estimated effects follow from fitting 
the to the data. Options: 


(i) response schedule (ii) regression equation 


True or false: the causal effect of X on Y is demonstrated by doing 
something to the data with the computer. If true, what is the something? 
If false, what else might you need? Explain briefly. 


What is the exogeneity assumption? 


Suppose the exogeneity assumption holds. Can you use the data to show 
that a response schedule is false? Usually? Sometimes? Hardly ever? 
Explain briefly. 


Suppose the exogeneity assumption holds. Can you use the data to show 
that a response schedule is true? Usually? Sometimes? Hardly ever? 
Explain briefly. 


How would you answer questions 18 and 19 if the exogeneity assumption 
itself were doubtful? 


Gilens (2001) proposes a logit model to explain the effect of general 
political knowledge on policy preferences. The equation reported in the 
paper is 

prob(Y; = 1) =a + BG; + Xiy + Ui, 


where i indexes subjects; Y; = 1 if subject i favors a certain policy and 
Y; = 0 otherwise; G; measures subject i’s general political knowledge; 
Xi is a 1 x p vector of control variables; and U; is an error term for 
subject i. In this model, œ and £ are scalar parameters, the latter being 
of primary interest; y is a p x1 parameter vector. Did Gilens manage to 
write down a logit model? If not, fix the equation. 


Mamaros and Sacerdote (2006) look at variables determining volume of 
email. Their study population consists of students and recent graduates 
of Dartmouth; the study period year is one academic year. Let Y;; be 
the number of emails exchanged between person i and person j, while 
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Xi is a 1 x p vector describing characteristics of person i, and £ is a 
p x parameter vector. Furthermore, X;; is a 1 xq vector describing 
characteristics of the pair (i, j), and y is a q x 1 parameter vector. Let 


exp(x) =e*, p(y|A) = exp(—A)A’/y!. 


To estimate parameters, the authors maximize 


X0 log p(¥ij | exp (XiB + XjB + Xijy)) 


l<i<j<n 


as a function of 6 and y, where n is the number of subjects. (Some 
details are omitted.) Comment briefly on the data analysis. 


Suppose Y; = a+bZ;-+cW; + €;, where the <; are IID with expectation 
0 and variance ø?. However, W; may be endogeneous. Assume that 
Zi = 0 or 1 has been assigned at random, so the Z’s are independent 
of the W’s and €’s. Let É be the coefficient of Z when the equation is 
estimated by OLS. True or false and explain: Í is an unbiased estimate 
of b. 


Suppose Y = Xf + €, where Y and € aren x 1, X isn x p of full 
rank, the e; are IID with E(e;) = 0 and E(e;”) = o°. Here, B and 
a? are parameters to be estimated from the data. However, X may be 
endogenous. Let Z be exogenous, n x q, with q > p. We assume 
Z'Z and Z'X have full rank. Let eos = Y — X ors. Let ervLs = 
Y — Xırs. Let f = (Z’Z)“"/2Z'Y — (Z' Z) !/2Z' X Brg. In the 
first stage of IISLS, X is regressed, column by column, on Z; let X be 
the fitted values and A = X — X. As usual, | means orthogonality for 
data vectors; IL means statistical independence of random vectors. Say 


whether each of the following statements is true or false: 
e lx celz ell X ellz 
eors L X eors LZ eorslLX eors IL Z 


elvis L X eivLs L Z ewisILX errs Z 


fX fLZ f1 (Z'Z ZX 
fILXx fiz fIL(Z'Z) P Z'X 
ALX ALZ AILX AMZ 


If eots L X, then X is exogenous. 
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25. 


If f L (Z'Z)~'/?Z’X, that provides support for the 
assumed exogeneity of Z. 


If f L (Z'Z)~'/*Z’'X, then small-sample bias is 0. 


Suppose X;, 6;, €; are independent normal variables, with expectation 
0, fori = 1,2,...,n. The variances are 1, oo and t? respectively. 
Suppose 

(26) Y; = bX; + ôi, 

(27) Wi = cY; T Ei. 


These equations reflect true causal relations; b and c are parameters. A 
statistician fits 


(28) Y; = dW; + eX; + üi 


to the data. 
(a) Are the subjects IID? 
(b) Shouldn’t there be intercepts in the equations? 
(c) Is (28) a good causal model? 
(d) Can you choose the parameters so the R? for (26) is low, while the 
R? for (28) is high? 
(e) If the sample is large, find approximate values for d and ê. 


Explain briefly. 


9.10 End notes for chapter 9 


Further reading on econometric technique 

Davidson R, MacKinnon JG (2003). Econometric Theory and Methods. Ox- 
ford University Press. A standard graduate-level textbook. Broad coverage. 
Theoretical. 

Greene WH (2007). Econometric Analysis. 6th ed. Prentice Hall. A standard 
graduate-level textbook. Broad coverage. Theoretical. 

Kennedy P (2003). A Guide to Econometrics. 5th ed. MIT Press. Informal, 
clear, useful. 

Maddala GS (2001). Introduction to Econometrics. 3rd ed. Wiley (2001). 
Chatty and clear. 
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Theil H (1971). Principles of Econometrics. Wiley. This is a formal treat- 
ment, but it is clear and accurate. On pp. 444, 451, Theil writes X for the 
matrix of instrumental variables, Z for the endogenous variables, which is 
opposite to the convention adopted here. 


Wooldridge JM (2005). Introductory Econometrics. 3rd ed. Southwestern 
College Publishers. A standard undergraduate textbook. Applied focus. 


The New York Times. The articles mentioned in questions 4, 5, and 8 are 
from 7/23/2007 page A1, 12/21/2008 page A27, and 12/12/2006 page D3, 
respectively. 


Indonesia. Question 7 draws on unpublished reports by Susan Hyde and 
Thad Dunning, Yale University, describing work done for the Carter Center. 


A case study. Many of the issues discussed in this chapter are illustrated 
by DiNardo and Pischke’s critique of Krueger— 


Krueger AB (1993). How computers have changed the wage structure: Evi- 
dence from microdata, 1984-1989. Quarterly Journal of Econometrics 108: 
33-60. 

DiNardo JE, Pischke JS (1997). The returns to computer use revisited: Have 


pencils changed the wage structure too? Quarterly Journal of Econometrics 
112: 291-303. 


10 


Issues in Statistical Modeling 


10.1 Introduction 


It is an article of faith in much applied work that disturbance terms are 
11D—Independent and Identically Distributed—across observations. Some- 
times, this assumption is replaced by other assumptions that are more com- 
plicated but equally artificial. For example, when observations are ordered in 
time, the disturbance terms €; are sometimes assumed to follow an “autore- 
gression,” e.g., €; = A€;_; + ôr, where now A is a parameter to be estimated, 
and it is the ô; that are IID. However, there is an alternative that should al- 
ways be kept in mind. Disturbances are DDD—Dependent and Differently 
Distributed—across subjects. In the autoregression, for example, the ô; could 
easily be DDD, and introducing yet another model would only postpone the 
moment of truth. 

A second article of faith for many applied workers is that functions are 
linear with coefficients that are constant across subjects. The alternative is 
that functions are non-linear, with coefficients (or parameters more gener- 
ally) that vary across subjects. The dueling acronyms would be LCC (Linear 
with Constant Coefficients) and NLNC (Non-Linear with Non-constant Co- 
efficients). Some models have “random coefficients,” which only delays the 
inevitable: coefficients are assumed to be drawn at random from distributions 
that are constant across subjects. Why would that be so? 
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These articles of faith have had considerable influence on the applied 
literature. Therefore, when reading a statistical study, try to find out what 
kind of statistical analysis got the authors from the data to the conclusions. 
What are the assumptions behind the analysis? Are these assumptions plau- 
sible? What is allowed to vary and what is taken to be constant? If causal 
inferences are made from observational data, why are parameters invariant 
under interventions? Where are the response schedules? Do the response 
schedules describe reasonable thought experiments? 

For applied workers who are going to publish research based on statistical 
models, the recommendation is to archive the data, the equations, and the 
programs. This would allow replication, at least in the narrowest sense of the 
term (Dewald et al 1986, Hubbard et al 1998). Assumptions should be made 
explicit. It should be made clear which assumptions were checked, and how 
the checking was done. It should also be made clear which assumptions were 
not checked. Stating the model clearly is a good first step—and a step which 
is omitted with remarkable frequency, even in the best journals. 

Modelers may feel there are responses to some of these objections. For 
example, a variety of relevant techniques have not been considered in this 
book, including regression diagnostics, specification tests, and model selec- 
tion procedures. These techniques might be helpful. For instance, diagnos- 
tics are seldom reported in applied papers, and should probably be used more 
often. 

In the end, however, such things work only if there is some relatively lo- 
calized breakdown in the modeling assumptions—a technical problem which 
has a technical fix. There is no way to infer the “right” model from the 
data unless there is strong prior theory to limit the universe of possible mod- 
els. (More technically, diagnostics and specification tests usually have good 
power only against restricted classes of alternatives: Freedman 2008d.) That 
kind of strong theory is rarely available in the social sciences. 

Model selection procedures like AIC (Akaike’s Information Criterion) 
only work—under suitable regularity conditions—“in the limit,’ as sample 
size goes to infinity. Even then, AIC overfits. Therefore, behavior in finite 
samples needs to be assessed. Such assessments are unusual. Moreover, AIC 
and the like are commonly used in cases where the regularity conditions do 
not hold, so operating characteristics of the procedures are unknown, even 
with very large samples. Specification tests are open to similar objections. 

Bayesian methods are sometimes thought to solve the model selection 
problem (and other problems too). However, in non-parametric settings, 
even a strictly Bayesian approach can lead to inconsistency, often because 
of overfitting. “Priors” that have infinite mass or depend on the data merely 


ISSUES IN STATISTICAL MODELING 211 


cloud the issue. For reviews, see Diaconis and Freedman (1998), Eaton and 
Freedman (2004), Freedman (1995). 


The bootstrap 


How does the bootstrap fit into this picture? The bootstrap is in many 
cases a helpful way to compute standard errors—given the model. The boot- 
strap usually cannot answer basic questions about validity of the model, but 
it can sometimes be used to assess impacts of relatively minor failures in 
assumptions. The bootstrap has been used to create chance models from data 
sets, and some observers will find this pleasing. 


The role of asymptotics 


Statistical procedures are often defended on the basis of their “asymp- 
totic” properties—the way they behave when the sample is large. See, for 
instance, Beck (2001, p. 273): “methods can be theoretically justified based 
on their large-[sample] behavior.” This is an oversimplification. If we have a 
sample of size 100, what would happen with a sample of size 100,000 is not a 
decisive consideration. Asymptotics are useful because they give clues to be- 
havior for samples like the one you actually have. Furthermore, asymptotics 
set a threshold. Procedures that do badly with large samples are unlikely to 
do well with small samples. 

With the central limit theorem, the asymptotics take hold rather quickly: 
when the sample size is 25, the normal curve is a often a good approximation 
to the probability histogram for the sample average; when the sample size is 
100, the approximation is often excellent. With feasible GLS, on the other 
hand, if there are a lot of covariances to estimate, the asymptotics take hold 
rather slowly (chapter 8). 


Philosophers’ stones in the early twenty-first century 


Correlation, partial correlation, Cross lagged correlation, Princi- 
pal components, Factor analysis, OLS, GLS, PLS, ISLS, IISLS, 
IVLS, FIML, LIML, SEM, GLM, HLM, HMM, GMM, ANOVA, 
MANOVA, Meta-analysis, Logits, Probits, Ridits, Tobits, RESET, 
DFITS, AIC, BIC, MAXENT, MDL, VAR, AR, ARIMA, ARFIMA, 
ARCH, GARCH, LISREL, Partial likelihood, Proportional hazards, 
Hinges, Froots, Flogs with median polish, CART, Boosting, Bag- 
ging, MARS, LARS, LASSO, Neural nets, Expert systems, Bayesian 
expert systems, Ignorance priors, WinBUGS, EM, LM, MCMC, 
DAGs, TETRAD, TETRAD II... 
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The modelers’ response 


We know all that. Nothing is perfect. Linearity has to be a good 
first approximation. Log linearity has to be a good first approxi- 
mation. The assumptions are reasonable. The assumptions don’t 
matter. The assumptions are conservative. You can’t prove the as- 
sumptions are wrong. The biases will cancel. We can model the 
biases. We’re only doing what everybody else does. Now we use 
more sophisticated techniques. If we don’t do it, someone else will. 
What would you do? The decision-maker has to be better off with 
us than without us. We all have mental models. Not using a model 
is still a model. The models aren’t totally useless. You have to do 
the best you can with the data. You have to make assumptions in 
order to make progress. You have to give the models the benefit of 
the doubt. Where’s the harm? 


The difficulties in modeling are not unknown. For example, Hendry 
(1980, p. 390) writes that “Econometricians have found their Philosophers’ 
Stone; it is called regression analysis and is used for transforming data into 
‘significant’ results!” This seriously under-estimates the number of philoso- 
phers’ stones. Hendry’s position is more complicated than the quote might 
suggest. Other responses from the modeling perspective are quite predictable. 


10.2 Critical literature 


For the better part of a century, many scholars in many different disci- 
plines have expressed considerable skepticism about the possibility of disen- 
tangling complex causal processes by means of statistical modeling. Some of 
this critical literature will be reviewed here. The starting point is the exchange 
between Keynes (1939, 1940) and Tinbergen (1940). Tinbergen was one of 
the pioneers of econometric modeling. Keynes expressed blank disbelief 
about the development: 


"No one could be more frank, more painstaking, more free from sub- 
jective bias or parti pris than Professor Tinbergen. There is no one, 
therefore, so far as human qualities go, whom it would be safer to trust 
with black magic. That there is anyone I would trust with it at the present 
stage, or that this brand of statistical alchemy is ripe to become a branch 
of science, I am not yet persuaded. But Newton, Boyle and Locke all 
played with alchemy. So let him continue.” (Keynes 1940, p. 156) 


Other familiar citations in the economics literature include Liu (1960), 
Lucas (1976), and Sims (1980). Lucas was concerned about parameters 
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that changed under intervention. Manski (1995) returns to the problem of 
under-identification that was posed so sharply by Liu and Sims: in brief, a 
priori exclusion of variables from causal equations can seldom be justified, so 
there will typically be more parameters than data. Manski suggests methods 
for bounding quantities that cannot be estimated. Sims’ idea was to use low- 
dimensional models for policy analysis, instead of complex high-dimensional 
ones. Leamer (1978) discusses the issues created by inferring specifications 
from the data, as does Hendry (1980). Engle, Hendry, and Richard (1983) 
distinguish several kinds of exogeneity assumptions. 

Heckman (2000) traces the development of econometric thought from 
Haavelmo and Frisch onwards. Potential outcomes and structural parameters 
play a central role, but “the empirical track record of the structural [modeling] 
approach is, at best, mixed” [p. 49]. Instead, the fundamental contributions 
of econometrics are the insights 


“that causality is a property of a model, that many models may explain 
the same data and that assumptions must be made to identify causal or 
structural models. ...” [p. 89] 


Moreover, econometricians have clarified “the possibility of interrelation- 
ships among causes,” as well as “the conditional nature of causal knowledge 
and the impossibility of a purely empirical approach to analyzing causal ques- 
tions” [pp. 89-90]. Heckman concludes that 


“The information in any body of data is usually too weak to eliminate 
competing causal explanations of the same phenomenon. There is no 
mechanical algorithm for producing a set of ‘assumption free’ facts or 
causal estimates based on those facts.” [p. 91] 


Some econometricians have turned to natural experiments for the eval- 
uation of causal theories. These investigators stress the value of strong re- 
search designs, with careful data collection and thorough, context-specific, 
data analysis. Angrist and Krueger (2001) have a useful survey. 

Rational choice theory is a frequently-offered justification for statistical 
modeling in economics and cognate fields. Therefore, any discussion of 
empirical foundations must take into account a remarkable series of papers, 
initiated by Kahneman and Tversky (1974), that explores the limits of rational 
choice theory. These papers are collected in Kahneman, Slovic, and Tversky 
(1982), Kahneman and Tversky (2000). The heuristics-and-biases program of 
Kahneman and Tversky has attracted its own critics (Gigerenzer 1996). The 
critique is interesting, and has some merit. But in the end, the experimental 
evidence demonstrates severe limits to the power of rational choice theory 
(Kahneman and Tversky 1996). 
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The data show that if people are trying to maximize expected utility, they 
don’t do it very well. Errors are large and repetitive, go in predictable di- 
rections, and fall into recognizable categories. Rather than making decisions 
by optimization—or bounded rationality, or satisficing—people seem to use 
plausible heuristics that can be classified and analyzed. Rational choice the- 
ory is generally not a good basis for justifying empirical models of behavior, 
because it does not describe the way real people make real choices. 

Sen (2002), drawing in part on the work of Kahneman and Tversky, 
gives a far-reaching critique of rational choice theory, with many counter- 
examples to the assumptions. The theory has its place, according to Sen, but 
also leads to “serious descriptive and predictive problems” [p. 23]. Nelson 
and Winter (1982) reached similar conclusions in their study of firms and 
industries. The axioms of orthodox economic theorizing, profit maximization 
and equilibrium, create a “flagrant distortion of reality” [p. 21]. 

Almost from the beginning, there were critiques of modeling in other 
social sciences too. Bernert (1983) and Platt (1996) review the historical 
development in sociology. Abbott (1997) finds that variables like income and 
education are too abstract to have much explanatory power; so do models 
built on those variables. There is a broader examination of causal modeling 
in Abbott (1998). He finds that “an unthinking causalism today pervades our 
journals and limits our research” [p. 150]. He recommends more empha- 
sis on descriptive work and on smaller-scale theories more tightly linked to 
observable facts—middle-range theories, in Robert Merton’s useful phrase. 
Clogg and Haritou (1997) consider difficulties with regression, noting that 
endogenous variables can all too easily be included as regressors. Hedström 
and Swedberg (1998) present a lively collection of essays by a number of so- 
ciologists who are quite skeptical about regression models. Rational choice 
theory also takes its share of criticism. 

Goldthorpe (1999, 2000, 2001) describes several ideas of causation and 
corresponding methods of statistical proof, which have different strengths and 
weaknesses. He is skeptical of regression, but finds rational choice theory to 
be promising—unlike other scholars cited above. He favors use of descrip- 
tive statistics to infer social regularities, and statistical models that reflect 
generative processes. He finds the manipulationist account of causation to be 
generally inadequate for the social sciences. Ni Bhrolcháin (2001) has some 
particularly forceful examples to illustrate the limits of modeling. 

Lieberson (1985) finds that in social science, non-experimental data are 
routinely analyzed as if they had been generated experimentally, the typi- 
cal mode of analysis being a regression model with some control variables. 
This enterprise has “no more merit than a quest for a perpetual-motion ma- 
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chine” [p. ix]. Finer-grain analytic methods are needed for causal inference, 
more closely adapted to the details of the problem at hand. The role of 
counter-factuals is explained (pp. 45—48). 

Lieberson and Lynn (2002) are equally skeptical about mimicking ex- 
perimental control through complex statistical models: simple analysis of 
natural experiments would be preferable. Sobel (1998) reviews the literature 
on social stratification, concluding that “the usual modeling strategies are 
in need of serious change” [p. 345]. Also see Sobel (2000). In agreement 
with Lieberson, Berk (2004) doubts the possibility of inferring causation by 
statistical modeling, absent a strong theoretical basis for the models—which 
rarely is to be found. 

Paul Meehl was a leading empirical psychologist. His 1954 book has 
data showing the advantage of using regression, rather than experts, to make 
predictions. On the other hand, his 1978 paper, “Theoretical risks and tabular 
asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology,” saw 
hypothesis tests—and cognate black arts—as stumbling blocks that slowed 
the progress of psychology. Meehl and Waller (2002) discusses the choice 
between two similar path models, viewed as reasonable approximations to 
some underlying causal structure, but does not reach the critical question— 
how to assess the adequacy of the approximations. 


Steiger (2001) provides a critical review of structural equation models. 
Larzalere et al (2004) offer a more general discussion of difficulties with 
causal inference by purely statistical methods. Abelson (1995) has a distinc- 
tive viewpoint on statistics in psychology. There is a well-known book on the 
logic of causal inference, by Cook and Campbell (1979). Also see Shadish, 
Cook, and Campbell (2002), who have among other things a useful discussion 
of manipulationist versus non-manipulationist ideas of causation. 

Pilkey and Pilkey-Jarvis (2006) suggest that quantitative models in the 
environmental and health sciences are highly misleading. Also see Lom- 
borg (2001), who criticizes the Malthusian position. The furor surrounding 
Lomborg’s book makes one thing perfectly clear. Despite the appearance 
of mathematical rigor and the claims to objectivity, results of environmental 
models are often exquisitely tuned to the sensibilities of the modelers. 

In political science, after a careful review of the evidence, Green and 
Shapiro (1994) conclude “despite its enormous and growing prestige in the 
discipline, rational choice theory has yet to deliver on its promise to advance 
the empirical study of politics” [p. 7]. Fearon (1991) discusses the role 
of counter-factuals. Achen (1982, 1986) provides an interesting defense of 
statistical models; Achen (2002) is substantially more skeptical. Dunning 
(2008) focuses on the assumptions behind IVLS. 
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King, Keohane, and Verba (1994) are remarkably enthusiastic about 
regression. Brady and Collier (2004) respond with a volume of essays that 
compare regression methods to case studies. Invariance—together with the 
assumption that coefficients are constant across cases—is discussed under 
the rubric of causal homogeneity. The introductory chapter (Brady, Collier, 
and Seawright 2004) finds that 


“it is difficult to make causal inferences from observational data, espe- 
cially when research focuses on complex political processes. Behind the 
apparent precision of quantitative findings lie many potential problems 
concerning equivalence of cases, conceptualization and measurement, 
assumptions about the data, and choices about model specification. .. . 
The interpretability of quantitative findings is strongly constrained by 
the skill with which these problems are addressed.” [pp. 9-10] 


There is a useful discussion in Political Analysis vol. 14, no. 3, summer, 2006. 
Also see George and Bennett (2005), Mahoney and Rueschemeyer (2003). 
The essay by Hall in the latter reference is especially relevant. 

One of the difficulties with regression models is accounting for the e’s. 
Where do they come from, what do they mean, and why do they have the 
required statistical properties? Error terms are often said to represent the 
overall effects of factors omitted from the equation. But this characterization 
has problems of its own, as shown by Pratt and Schlaifer (1984, 1988). 

In Holland (1986, 1988), there is a super-population model—rather than 
individualized error terms—to account for the randomness in causal models. 
However, justifying the super-population model is no easier than justifying 
assumptions about error terms. Stone (1993) presents a super-population 
model with some observed covariates and some unobserved; this paper is 
remarkable for its clarity. 

Recently, strong claims have been made for non-linear methods that 
elicit the model from the data, and control for unobserved confounders, with 
little need for substantive knowledge (Spirtes-Glymour-Scheines 1993, Pearl 
2000). However, the track record is not encouraging (Freedman 1997, 2004; 
Humphreys and Freedman 1996, 1999). There is a free-ranging discussion of 
such issues in McKim and Turner (1997). Other cites to the critical literature 
include Oakes (1990), Diaconis (1998), Freedman (1985, 1987, 1991, 1995, 
1999, 2005). Hoover (2008) is rather critical of the usual econometric models 
for causation, but views non-linear methods as more promising. 

Matching may sometimes be a useful alternative to modeling, but it is 
hardly a universal solvent. In many contexts there will be little difference 
between matching and modeling, especially if the matching is done on the 
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basis of statistical models, or data from the matching are subjected to model- 
based adjustments. For discussion and examples, see Glazerman, Levy, and 
Myers (2003); Arceneaux, Gerber, and Green (2006); Wilde and Hollister 
(2007); Berk and Freedman (2008); Review of Economics and Statistics, 
February (2004) vol. 86, no. 1; Journal of Econometrics, March—April (2005) 
vol. 125, nos. 1—2. 


10.3 Response schedules 


The response-schedule model is the bridge between regression and cau- 
sation, as discussed in section 6.4. This model was proposed by Ney- 
man (1923). The paper is in Polish, but there is an English translation by 
Dabrowska and Speed in Statistical Science (1990), with discussion. Scheffé 
(1957) gave an expository treatment. The model was rediscovered a number 
of times, and was discussed in elementary textbooks of the 1960s: see Hodges 
and Lehmann (1964, section 9.4). The setup is often called “Rubin’s model:” 
see for instance Holland (1986, 1988), who cites Rubin (1974). That simply 
mistakes the history. 

Neyman’s model covers observational studies—in effect, assuming these 
studies are experiments after suitable controls have been introduced. Indeed, 
Neyman does not require random assignment of treatments, assuming in- 
stead an urn model. The model is non-parametric, with a finite number of 
treatment levels. Response schedules were developed further by Holland and 
Rubin among others, with extensions to real-valued treatment variables and 
parametric models, including linear causal relationships. 

As demonstrated in chapters 6-9, response schedules help clarify the 
process by which causation can be, under some circumstances, inferred by 
running regressions on observational data. The mathematical elegance of 
response schedules should not be permitted to obscure the basic issue. To 
what extent are the assumptions valid, for the applications of interest? 


10.4 Evaluating the models in chapters 7-9 


Chapter 7 discussed a probit model for the effect of Catholic schools 
(Evans and Schwab 1995). Chapter 9 considered a simultaneous-equation 
model for education and fertility (Rindfuss et al 1980), and a linear prob- 
ability model for social capital (Schneider et al 1997). In each case, we 
found serious difficulties. The studies under review are at the high end of 
the social science literature. They were chosen for their strengths, not their 
weaknesses. The problems are not in the studies, but in the modeling technol- 
ogy. More precisely, bad things happen when the technology is applied to real 
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problems—without validating the assumptions behind the models. Taking as- 
sumptions for granted is what makes statistical techniques into philosophers’ 
stones. 


10.5 Summing up 


In the social and behavioral sciences, far-reaching claims are often made 
for the superiority of advanced quantitative methods—by those who manage 
to ignore the far-reaching assumptions behind the models. In section 10.2, we 
saw there was considerable skepticism about disentangling causal processes 
by statistical modeling. Earlier in the book, we examined several well-known 
modeling exercises, and discovered good reasons for skepticism. Some kinds 
of problems may yield to sophisticated statistical technique; others will not. 
The goal of empirical research is—or should be—to increase our understand- 
ing of the phenomena, rather than displaying our mastery of technique. 
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Answers to Selected Exercises 


Chapter 1 Observational Studies and Experiments 
Exercise Set A 


1. In table 1, there were 837 deaths from other causes in the total treatment 
group (screened plus refused) and 879 in the control group. Not much 
different. 


Comments. (i) Groups are the same size, so we can look at numbers or rates. 
(ii) The difference in number of deaths is relatively small, and not statistically 
significant. 


2. This comparison is biased. The control group includes women who 
would have accepted screening if they had been asked, and are therefore 
comparable to women in the screening group. But the control group 
also includes women who would have refused screening. The latter are 
poorer, less well educated, less at risk from breast cancer. (A comparison 
that includes only the subjects who follow the investigators’ treatment 
plans is called “per protocol analysis,” and is generally biased.) 


3. Natural experiment. The fact that the Lambeth Company moved its pipe 
(i) sets up the comparison with Southwark & Vauxhall (table 2) and 
(ii) makes it harder to explain the difference in death rates between the 
Lambeth customers and the Southwark & Vauxhall customers on the 
basis of some difference between the two groups—other than the water. 
For instance, people were generally not choosing between the two water 
companies on the basis of how the water tasted. If they had been, self- 
selection and confounding would be bigger issues. The change in water 
intake point is one basis for the view that the data could be analyzed as 
if they were from a randomized controlled experiment. 


Observational study. Hence the need for adjustment by regression. 


5. (i) If—0.755, outrelief prevents poverty. 

(ii) If +0.005, outrelief has no real effect on poverty. 
6. (i) E(S,) = npu and var(S,) = no”. 

(ii) E(Sp/n) = u and var(S,/n) = o° /n. 
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7. (i) E(Sn) = np and var(S,) = np(1 — p). 
Gi) E(S,/n) = p and var(S,/n) = p(l — p)/n. 
NB. For many purposes, variance has the wrong size and the wrong units. 


Take the square root of the variance to get the standard error. 


8. The law of large numbers says that with a big sample, the sample average 
will be close to the population average. More technically, let X1, X2,... 
be independent and identically distributed with E(X;) = u. Then 


(X1 + X2+-+-+Xn)/n > pw 
with probability 1. 


9. Reverse causation is plausible: on the days when the joints don’t hurt, 
subjects feel that religious coping worked. 


10. Association is not the same as causation. The big issue is confounding, 
and it is easy to get fooled. On the other hand, association is often a 
good clue. Sometimes, you can make a very tight argument for causation 
based on observational data. See text for discussion and examples. 


Comments. If the material on experiments and observational studies is unfa- 
miliar, you might want to read chapters 1, 2, and 9 in Freedman-Pisani-Purves 
(2007). For more information on intention-to-treat and per-protocol analysis, 
see Freedman (2006b). 


Chapter 2 The Regression Line 
Exercise Set A 


1. (a) False. The son is likely to be shorter: the 50—50 point is 
33.9 + 0.51472 = 70.9 inches. 
To see this, use the regression line computed in part (b). 
(b) The slope is 0.501 x 2.81/2.74 = 0.514. The intercept is 
68.7 — 0.514 x 67.7 = 33.9 inches. 

The RMS error is v 1 — 0.501? x 2.81 = 2.43 inches. 
Comment. The SD line says that sons are 1 inch taller than their fathers. 
However, it is the regression line that picks off the centers of the vertical 
strips, not the SD line, and the regression line is flatter than the SD line—the 
“regression effect.” If the material on correlation and regression is unfamiliar, 
you might want to read chapters 8—12 in Freedman-Pisani-Purves (2007). 
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2 


According to the model, if the weight x; = 0, the measured length is 
Y; = a + €; # a. In short, a cannot be determined exactly, due to 
measurement error. With ten measurements, the average is a, plus the 
average of the ten €;’s. This still isn’t a, but it’s closer. 


If we take â = 439.01 and Í = 0.05, the residuals are —0.01, 0.01, 0.00, 
0.00, —0.01, —0.01. The RMS error is the better statistic. It is about 
0.008 cm. The MSE is 0.00007 cm”. Wrong size, wrong units. (Resid- 
uals don’t add to 0 because â and b were rounded.) 

Use r for the left hand scatter plot. The middle one is U-shaped, and 
the right hand one has two clouds of points stuck together: r doesn’t 
reveal these features of the data. If in doubt, read chapter 8 in Freedman- 
Pisani-Purves (2007). 


Exercise Set B, Chapter 2 


l. 


2 
3. 
4. 
5 


In equation (1), variance applies to data. So does correlation in (4). 
These are estimates. 

The regression line is y = 439.0100 + 0.0495x. 

Data. 

35/12 starts life as the variance of the list {1, 2, 3, 4, 5, 6}, which could 


be viewed as data. If you pick a number at random from the list, that’s 

a random variable, whose variance is 35/12. 

The expected value is 180 x 1/6 = 30, which goes into the first blank. 

The variance is 180 x (1/6) x (5/6) = 25. But it is V25 = 5 that goes 

into the second blank. 

The expected value is 1/6 = 0.167. The variance is (1/6)x(5/6)/250 = 

0.000556. The SE is 0.000556 = 0.024, The expected value goes into 

the first blank. The SE—not the variance—goes into the second blank. 

(a) The observed value for the number of 1’s is 17. The expected value 
is 100 x 1/4 = 25. The SE is ,/100x (1/4) x (3/4) = 4.33. The 
observed number of 1’s is 1.85 SEs below expected. Eliminate the 
“number of 1’s.” 


The observed value for the number of 2’s is 54. The expected value 
is 100 x 1/2 = 50. The SE is /100x (1/2) x (1/2) = 5. The 
observed number of 2’s is 0.8 SEs above expected: the “number of 
2’s” goes into the blank. 


(b) The observed value for the number of 5’s is 29. The expected value 
is 25. The SE is 4.33. The observed number of 5’s is 0.92 SEs 
above expected. Eliminate the “number of 5’s.” 
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11. 
12. 


13. 
14. 
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The observed value for the sum of the draws is 17+108+145 = 270. 
The average of the box is 2.5; the SD is 1.5. The expected value for 
the sum is 100x2.5 = 250. The SE for the sum is //100x 1.5 = 15. 
The observed value is 1.33 SEs above the expected value: the “sum 
of the draws” goes into the blank. 

If this is unfamiliar ground, you might want to read chapter 17 in 

Freedman-Pisani-Purves (2007). 

Model. 

a and b are unobservable parameters; <€; is an unobservable random 

variable; Y; is an observable random variable. 

The observed value of a random variable. 

(a) Dix) = (Xi xi) nx = nx — nx =0. 

(b) Just square it out: 


n n 


3 0i -0° = ye [@—-H)+@- co] 


1 


1 
=> [m -E+ E- c}? +20; -ME-O)]. 
1 


But X} [20 -DE —c)] =2@ —c) X] (i — X) = 0 by (a). 
And >“) (® —c)? = nE — c}. 
(c) Use (b): Œ — c)* > 0, with a minimum inc at c = X. 
(d) Putc =Oin(b). 
Sample mean: see 12(c). 
Part (a) follows from equation (4); part (b), from (5). Part (c) follows 
from equation (1). For part (d), 
(xi — X) Qi — Y) = Xi Yi — Xyi — xiy + XY. 
So 


ees z OS 
cov(x, y) = z ) (xii — Xyi — xiy + xX Y) 
i=l 


n n n 
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1 n 
=- X oon — xy. 
n 
i=l 


Part (e). Put y = x in (d) and use (c). 


No answers supplied for exercises 15-16. 
17. (a) P(X, =3|X1ı + X2 = 8) equals 


P(X, =3,X2.=5) _ 1/36 


z = 1/5. 
P(X, + X2 = 8) 5/36 


(b) P(X, + X2 =7|X, =3) = P(X. = 4| X1 =3) = 1/6. 

(c) Conditionally, X; is 1, 2, 3, 4, or 5 with equal probability, so the 
conditional expectation is 3. 

Generally, P(A|B) = P(A and B)/ P(B). If X; and X2 are indepen- 

dent, conditioning on X; doesn’t change the distribution of X2. Exercise 

17 prepares for chapter 4. If the material is unfamiliar, you might wish 

to read chapters 13—15 in Freedman-Pisani-Purves (2007). 


18. Each term |x; — c| is continuous in c. The sum is too. Suppose xı < 
X2 <+++ < Xn andn = 2m + 1 with m > 0. (The case m = 0 is pretty 
easy; do it separately.) The median is xm+1. Fix j withm+1 <j <n. 
Let xj < € < xj+1. Now 


j n 
fO =} c- xi) + 
i=l 


(xj — ©). 
i=j+l 


So f is linear on the open interval (x;, xj+1), with slope j — (n — j) = 
2j —n > 0. (The case c > xn is similar and is omitted.) That is why f 
increases to the right of the median. The slope increases with j, so f is 
convex. The argument for the left side of c is omitted. 


Chapter 3 Matrix Algebra 


Exercise Set A 


l. rj is1xn,cj isnx1, and r; xc; is the ij th element of A x B. 


No answers supplied for exercises 2—4. Exercise 2 is one explanation for 
non-commutativity: if f, g are mappings, seldom will f(g(x)) = (f (x)). 
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107 7 
5, wm =( ve ana = (7 5 6 ) 
7 6 17 


Both matrices have the same trace. In fact, trace(AB) = trace(BA) 
when both products are defined, as will be discussed later (exercise B11). 


6. llul = v6 = 2.45 and |lv|| = v21 = 4.58. The vectors are not 
orthogonal: u’v = v'u = 1. The outer product is 


1 2 4 
w=(2 4 s). 
ee 


The trace is 1. Again, trace(uv’) = trace(v’u). 


Exercise Set B, Chapter 3 


No answers supplied for exercises 2-8 or 11-13. 


2 1 -7 
1. The adjoint is (= 1 5 ) 
2 -1 -l 
9. (a) Choose aniintherange1,...,mandakintherange1,..., p. The 
ikth element of MN is q = D M;;Njx. So q is the kith element 
of (MN)’. Also, q is the ki th element of N'M’. 


(b) For the first claim, MNN~'M~! = MI,x,M~! = MM! = 
Ipx p, as required. For the second claim, (M7 hy M' = (MM~!y = 


I’ px p = Tpx p» as required; use part (a) for the first equality. 


10. Let c be p x 1. Suppose X has rank p. Why does X’X have rank p? 
If X'‘Xc = 0px1, then—following the hints—c' X’Xc = 0 > ||Xce||? = 
0 => Xc = Onx1 = c = Opx1 because X has rank p. Conversely, 
suppose X’X has rank p. Why does X have rank p? If Xc = 0, x1, then 
X'Xc = 0px1, sO c = Opx1 because X’X has rank p. 


Review of terminology. Suppose M is m xn. Its columns are linearly inde- 
pendent provided Mc = 0m x1 entails c = On x1 for any n vector c, similarly 
for rows. By way of contrast, the columns of M are linearly dependent when 
there is an n vector c Æ Onxı with Mc = 0,,x1. For instance, if the first 
column of M vanishes identically, or the first column equals the difference 
between the the second and third columns, then the columns of M are linearly 
dependent. Suppose X is n x p withn > p. Then X has full rank if its rank 
is p, i.e., the columns of X are linearly independent. 
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14. Answers are given only for parts (1) and (m). 
(D No: X has row rank p, so there is a non-trivial n x 1 vector c with 
CX = xp > X'e = 0px1 > XX'C = Onyx 1, and XX’ isn’t 
invertible. 
(m) No: X isn’t square because p < n. Only square matrices are 
invertible. 


Comment. In combination, parts (h)-(k) show that ||Y — Xy || is minimized 
when Y — Xy L X. This is sometimes called the “projection theorem.” 


15. Because X is a column vector, X'Y = X-Y and X’X = ||X||?; substitute 
into the formula for £. 


16. Substitute into 15. We’ve derived 2B12(c) from a more general result. 


17. f is the residual vector when we regress Y on M. So f L M by 14(g). 
Likewise, g is the residual vector when we regress N on M, so g L M. 
Next, e is a linear combination of f and g, soe L M. Ande L g: by 15, 
e is the residual vector when we regress f on g. Soe Lg+My=N. 
Consequently, e L X = (M N). We’re almost there: Y = My, + f = 
My +8 +e = My +(N—Myo)y3+e = MPi — PP) +N P +e 
withe L X. QED by 14(k). 

Comment. This result is sometimes called the “Frisch-Waugh” theorem by 


econometricians. Furthermore, regressing the original vector Y on g has the 
same effect as regressing f on g, since M L g. 


18. The rank is 1, because there is one free column (or row). 
Exercise Set C, Chapter 3 


No answers supplied for exercises 1-4. 


5. The first assertion follows from the previous exercise, but here is a direct 
argument: c'U — E(c'U) = c'[U — E(U)], so 


var(c'U) = E{c'[U — E(U)][U — E(U)]‘c} 
=cE{[U — E(U)][U — E(U)]'}c. 
For the second assertion, U + c — E(U +c) =U — E(U), so 
[U+c-EU+c)[U+c-EU +0)! = [U — E(U)I[U — EW)Y. 


Take expectations. 
Comment. Exercises 1-5 can be generalized to any number of dimensions. 


6. U is a scalar random variable, while E(U) is a fixed 3 x 1 vector. The 
mean is one thing and expectation is another, although “mean” is often 
used to signify expectation. 
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7. Neither proposition is true: P(E L ¢) = 1 does not imply é II ¢, and 
é Il € does not imply P(é L ¢) = 1. (Notation: JL means indepen- 
dence.) 


Comment. Suppose ¢ has a probability density, so P(¢ € H) = 0 for any 
6-dimensional hyperplane H. If P(€é | ¢) = 1, then & and ¢ cannot be in- 
dependent, because P(g € xt |é = x) = 1, where xt is the 6-dimensional 
hyperplane of vectors orthogonal to x. Conditioning on € changes the distri- 
bution of ¢. 


8. var(€) = E{[é — E(€)]}*} and cov(é, ¢) = E{[é — EJIE — E}. 
But E(é) = E(£) =0. 


9. cov(é) = E{[é — EGE — EEI}. But EE) = EE’) =0. 
10. (a) True. The pairs are identically distributed, and therefore have the 
same covariance. 
False. cov(&;, ¢;) is a theoretical quantity, computed from the joint 
distribution. By contrast, T XLi (& — €)(G — ©) is the sample 
covariance. Comment: when the sample is large, the sample co- 
variance will be close to the theoretical cov(&;, ¢;). 


1 = 
11. ©) | iG =) for —œ <x < œ. Ifo =OthenoX +u = u so 
o o 
the “density” is point mass at u. 
(ii) FA) + FOV) for < x < oo. If the function f is smooth, 
Zaft 
the density in (ii) is f’(O) at x = 0. 


(b 


wm 


The calculus may be confusing. We’ll go through (i) when øo < 0. Let 
Y = oX +u. Then Y < yif X > y* where y* = —(y — p)/|o|. 
So P(Y < y)= f e f(x)dx. Differentiate with respect to y, using the 
chain rule. The density of Y at y is |o |7! f (y*), and y* = (y — p)/o. 


Exercise Set D, Chapter 3 


The first matrix is positive definite; the second, non-negative definite. 
2. For (a), let c be a p x 1 vector. Then c'X’'Xc = ||Xc||* > 0, so X’X is 
non-negative definite. If c’X’Xc = 0 for c # 0px1, then Xc = Onx1 
and X is rank deficient: a linear combination of its columns vanishes. 
Contradiction. So X’X is positive definite. (Cf. exercise B10.) Part (b) 
is similar. 
Comment. If p < n, then XX’ cannot be positive definite: there is an n x 1 
vector c Æ Onx1 with c'X = 01x); then c’XX'c = 0. See exercise B14(1). 
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3. 
4. 


|| Rx ||? = (Rx) Rx = x'R' Rx = x'x. 

Let x be nx l and x Æ Onx1. To show that x’Gx > 0, define y = 
R’x. Then y Æ Onx1, and x = Ry. Now x’Gx = y'R'GRy = 
y'R'RDR'Ry = y' Dy = X ;_; Duy? > 0. 


No answers supplied for exercises 5-6. 


Te 


Theorem 3.1 shows that G = RDR’, where R is orthogonal and D is 
a diagonal matrix all of whose diagonal elements are positive. Then 
G7! = RD™!R’, G'? = RD'?R’, and G7! = RD~'/?R’ are 
positive definite: exercises 4—6. 

Let u = E(U), a3 x 1 vector. Then cov(U) = E[(U — u)(U — uy], 
a3 x3 matrix, call it M. So 0 < var(c'U) = c'Mc and M is non- 
negative definite. See exercise 3C5. If there is a3 x 1 vector c Æ 0 
with c'Mc = 0, then var(c'U) = 0, so U = E(c'U) = c'u with 
probability 1. 


Exercise Set E, Chapter 3 


1. 


(a) Define U as in the hint. Then cov(U) = G!/2cov(V)G!/? = 
G'/2G!/? = G. See exercise 3C4: G is symmetric. 
(b) Trya+G!/y. 


Check that E(RU) = 0 and cov(RU) = Reov(U)R’ = Ro? Inyn R! = 
o? RR! = 07 Inyn by exercises 3C3—4. Then use theorem 3.2. (A 
more direct proof shows that the density of RU equals the density of 
U, because R preserves lengths in Euclidean n-space; the change-of- 
variables formula is needed for integrals, in order to push this through.) 


If € and ¢ are jointly normal, the first proposition is good; if not, not. The 
second proposition is true: if € and ¢ are independent, their covariance is 
0. (In this book, all expectations, variances, covariances ... exist unless 
otherwise stated.) 


Answer omitted. 


E(é+¢)=a+ Bp. val +g) = var(é) + var(f) + 2cov(§, ¢) = 
o*+1*+2 ot; here, p is the correlation between the random variables 
€, ¢. Any linear combination of jointly normal variables is normal. 


The expected number of heads is 500. The variance is 1000 x 5 x 5 = 


250. The SE is ¥250 = 15.81. The range 475-525 is —1.58 to 1.58 in 
standard units, so the chance is almost the area under the normal curve 
between —1.58 and 1.58, which is 0.886. 
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Comment. The exact chance is 0.893, to three decimals. The normal curve is 
an excellent approximation. With the coin and other variables taking integer 
values, if the range is specified as “inclusive” you could add 0.5 at the right 
and subtract 0.5 at the left to get even better accuracy. This is the “continuity 
correction.” See, for instance, chapter 18 in Freedman-Pisani-Purves (2007). 
If e.g. the variable takes only even values, or it takes fractional values, life 
gets more complicated. 


7. p= 102/250 = 0.408, SE = 0.408 x 0.592/250 = 0.031. 
NB. Variance has the wrong size and the wrong units. Take the square 
root of the variance to get the SE. 


8. (a) 0.031. (b) 0.408 + 20.031. That’s what the SE does for a living. 
9. o? = 1/2. Ife.g. x > 0, then P(Z < x) = 0.5 + 0.5W(x//2). 


10. This is a special case of exercise 2. 


Chapter 4 Multiple Regression 
Exercise Set A 


(ii) is true by assumption (5); € L X is possible, but unlikely. 


2. (i) is true by exercise 3B14(g). Since e is computed from X and Y, 
(ii) will be false in general. 


3. No. Unless there’s a bug in the program, e L X. This has nothing to do 
with e IL X. 


Comments on exercises 1-3. In this book, orthogonality (L) is about a 
pair of vectors, typically deterministic: u L v if their inner product is 0, 
meaning the angle between them is 90°. Independence (1L) is about random 
variables or random vectors: if U lL V, the conditional distribution of V given 
U doesn’t depend on U. If U, V are random vectors, then P(U L V) = 1 
often precludes U || V, because the behavior of V depends on U. In some 
probability texts, if W; and W2 are random variables, W;  W 2 means 
E(W,W?2) = 0, and this too is called “orthogonality.” 


4. (a) e L X, so e is orthogonal to the first column in X, which says that 
Že =0. 

(b) No. If the computer does the arithmetic right, the sum of the resid- 
uals has to be 0, which says nothing about assumptions behind the 
model. 

(c) J; <i is around o y/n by the central limit theorem (section 3.5). 
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5. To begin with, €e ll X, so the conditioning is immaterial. For claim (i), 
e'e = X; €?, so Eee) = X, E(e?). But E(e;) = 0. So E(e?) = 
var(e;) = o? and E(€?) = no’. See (4), and exercise 3C8. For 
claim (ii), cov(e) = E(ee’) because E(e;) = 0, as before. Now ee’ is 
an nxn matrix, whose ij th element is e;e;. If i # j, then E(eje;) = 
E(e,)E(€;) = 0 x 0 = 0 by independence. If i = j, then E(€?) = 


var(€;) = 07: see above. 

6. The second column in the table (lengths) should be the observed values 
of Y; in equation (2.7), fori = 1,2,...,6. Cross-references: equation 
(2.7) is equation (7) in chapter 2. 

a 
7. Look at equation (1.1) to see that B = ‘ , so p = 4. Next, look 
d 


at table 1.3. There are 32 lines in the table, son = 32. There is an 
intercept in the equation, so put a column of 1’s as the first column of 
the design matrix X. Subtract 100 from each entry in table 1.3. After 
the subtraction, columns 2, 3, 4 of the table give you columns 2, 3, 4 
in the design matrix X; column 1 of the table gives you the observed 
values of Y. 


The first column in the design matrix is all 1’s, so X4, = 1. The fourth 
union is Chelsea. The second column in X is AOut, which also happens 
to be the second column in the table. So X42 = 21 — 100 = —79 and 
Y, = 64 — 100 = —36. The estimated coefficient Ê of AOut will be the 
second entry in B = (X'X)~!X’Y, because b is the second entry in £; 
that in turn is because AOut is the second thing in equation (1.1)—1ight 
after the intercept. 


Multidimensional scatter diagrams. Consider the partitioned matrix (X Y) 
which stacks the response variable Y to the right of the design matrix X. If the 
first column of X is a constant, ignore it in what follows. The other columns 
of X, together with Y, define p variables—which correspond to dimensions 
in the scatter diagram. The n rows of (X Y) correspond to data points. In 
section 2.2, there were two variables, so we plotted a two-dimensional scatter 
diagram in figure 2.1. There were n = 1078 data points. Son’s height is 
represented as a (very noisy) function of father’s height. (If all the columns 
of X are variable, well, there are p + 1 variables to worry about.) 


In exercise 7, there are four variables, so the “scatter plot” is four-dimensional. 
Three dimensions correspond to the explanatory variables, AOut, AOld, 
APop. The fourth dimension corresponds to the response variable APaup. 
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The response variable APaup is visualized as a noisy function of AOut, AOld, 
APop. There are n = 32 points in R4. For other purposes, it is convenient to 
represent the data as 4 points in R”, with one point for each column of (X Y), 
other than the column of 1’s. That is what we did in equations (2.1-4), and 
in theorem 4.1. We will do it again many times in this chapter and the next: 
n vectors are convenient for proving theorems. 


Mathematicians are fond of “visualizing” things in many dimensions. Maybe 
you get used to it with enough practice. However, for visualizing data, two- 
dimensional scatter diagrams are recommended: e.g., plot Y against each of 
the explanatory variables. Lab 3 below explains other diagnostics. 


Exercise Set B, Chapter 4 
1. True. 


2. (i) True. (ii) Before data collection, Y is a random variable; afterwards, 
it’s the observed value of a random variable. 


3. True. 


(i) True. (ii) Before data collection, the sample variance is a random 
variable; afterwards, it’s the observed value of a random variable. (In 
the exercise, we divided by n; for some purposes, it might be better to 
divide by n — 1: most often, it doesn’t matter which divisor you use.) 
5. B—B = (X’X)!X'e and e = (I — Hye. See equations (8) and (17). 
Condition on X. The joint distribution of (X’X)~!X’e and (I — H)e 
doesn’t depend on £: there’s no £ in the formula. 
6. Use formulas (10) and (11). Cf. lab 3 below. 
Use formulas (10) and (11). Cf. exercise 15 below. 


8. Formula (i) is the regression model. It has the parameters 6 and the 
random errors €. 


9. (i) is silly: at least in frequentist statistics, parameters don’t have covari- 
ances. (ii) is true if X is fixed, otherwise, trouble. (iii) is true. (iv) is 
false. On the left, given X, we have a fixed quantity. On the right, ĉ? is 
still random given X, because G* depends on € through Y, and € 1IL X. 
(v) is true. 

10. (b) is true. (a) is true if there is an intercept in the equation; or the 
constant vectors are in the column space of X. Generally, however, (a) 
is false. 


11. (a)is silly, because—given X—the right hand side is random and the left 
hand side isn’t. (b) is tue: Y = XB, so E(Y|X) = XE(B|X) = XB. 
(c) is true because E(e€|X) = 0. 
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12. 


13. 


14. 


15. 


Let H be the hat matrix. Exercise 3B9 shows that H = J. Then 
Y = HY = Y. Attention: this will not work if p < n. 
Let e = Y — XB be the residual vector. Now Y = xB +e. Bute = 0, 
because e is orthogonal to the first column of X. So Y = XB+@ = XP. 
When we’re taking averages over rows, £ can be viewed as constant— 
it’s the same for every row. (If we were talking about expectations, B 
would not be constant.) 
(a) var(B,—B2|X) = var (ĝi |X)+var(ĝ2|X)—2cov (ĝi, ĝ21X), i.e., the 
1,1 element of o’ (X OI plus the 2,2 element, minus two times 
the 1,2 element. This is a very useful fact: see, e.g., section 6.3. 
(b) E(c'B|X) = c'E(B|X) = c'b by theorem 2. Next, var(c’B|X) = 
c’cov(B|X)c = o?c'(X' X) !c by exercise 3C4 and theorem 3. 
The design matrix has a column of 1’s and then a column of X;’s. Call 
this matrix M. It will be convenient to use “bracket notation.” For 
instance, (X) = n7! 37) Xi, (XY) = n`! Ù} X;Y;, and so forth. By 
exercise 2B14(d)-(e), with var and cov applied to data variables, 


(X?) = var(X) + (X)? and (XY) =cov(X, Y) + (X)(Y). (x) 
fee n pS Ca es 1 (X) 
wi =(5"y, Ee) =" o i) 


fee (Y) 
mY =n( yh). 


With the help of equation (+), it is easy to check that 


Then 


and 


det(M'M) = n? ((X?) — (X)*) = n?var(X). 


1 Ge a 
nvar(X) \ —(X) 1 


etary 1 (XZY) = (X)(XY) 
(M M) mY = ——( (XY) — (X)(¥) ). 


So 
(M'M)"! = 


Now to clean up. The slope is the 2,1 element of (M’M)~!M'Y, which 
is 


(XY) — (X)(¥)]/var(X). 
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By (*), 
(XY) — (X)(Y) =cov(X, Y). 
So 
slope = cov(X, Y)/var(X) = rsy/syx 
as required. The intercept is the 1,1 element of (M’M)~!M'Y, which is 
[(X*)(¥) — (X)(XY)]/var(X). (**) 
Use (*) again to see that 
(X? NY) — (X)(XY) = [var(X) + (XY KY) — (X)[cov(X, Y) + (X)(¥)] 
= var(X) (Y) — (X)cov(X, Y), 
because the terms with (X)? (Y) cancel. Substitute into (**): 
intercept = (Y) — [(X)cov(X, Y)/var(X)] = (Y) — slope - (X) 
as required. The variance of the estimated slope is ø? times the 2,2 
element of (M’M)~!, namely, 
o7/[nvar(X)] 


as required. Since (X?) = var(X) + (XY by (*), the 1,1 element of 
(M'M)~! is 
ET alpa x? 
nvar(X) al a. 
The variance of the estimated intercept is obtained on multiplication by 
a?, which completes the argument. (The variance of an estimate ... 
applies variance to a random variable, not data.) 


Exercise Set C, Chapter 4 


1. 


We know Y = Y +e withe L X. Soe L Y and therefore 1 X; Yi; = 
1 5 ÊP. Since Y = ¥Y+eand >; e; = 0, we also know that 4 X>; ¥; 
1 X; Yi. Now use exercise 2B14, where cov and var are applied to data 
variables: 


cov(?, Y) = (- 2 Y;) - (- L fi) (- Fr) 
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The squared correlation coefficient between Y and Y is 


cov(Y, y)? o var(Y)? = var(Y) _ p2 
var(Y)var(Y)  var(Y)var(Y)  var(¥) 


Discussion questions, Chapter 4 


1. 


The random errors are independent from one subject to another; residuals 
are dependent. Random errors are independent of the design matrix X: 
residuals are dependent on X. Residuals are orthogonal to X, random 
errors are going to project into X, at least by a little. 


For instance, suppose there is an intercept in the equation, i.e., a column 
of 1’s in X. The sum of the residuals is 0: that creates dependence 
across subjects. The sum of the random errors will not be 0 exactly— 
that’s non-orthogonality. Since the residuals have to be orthogonal to 
X, they can’t generally be independent of X. 

If there is an intercept in the equation, the sum of the residuals has to 
be 0; or if the column space of the design matrix includes the constant 
vectors. Otherwise, the sum of the residuals will usually differ from 0. 


(a) is false and (b) is true (section 4.4). Teminology: in the regression 
model Y = Xf + e, the disturbance term for the ith subject is €;. 


Conditionally on X, the Y; are independent but not identically dis- 
tributed. For instance, E(Y;|X) = X;ß differs from one i to another. 
Unconditionally, if the rows of X are IID, so are the Y;; if the rows of X 
are dependent and differently distributed, so are the Y;. 


All the assertions are true—assuming the original equation is OK. Take 
part (c), for instance. The OLS assumptions would still hold, the true 
coefficient of the extra variable being 0. (We’re tacitly assuming that 
the new design matrix would still have full rank.) 


The computer can find B all right, but what is B estimating? And what 
do the standard errors mean? (The answers might well be, nothing.) 
R? measures goodness of fit. It does not measure validity. See text for 
discussion and examples. 

If r = +1, then column 2 = cxcolumn | + d. Since the columns have 
mean 0 and variance 1, c = +1 andd = 0, so the rank is 1. Suppose 
|r| < 1. Let M = [u, v], i.e., column 1 is u and column 2 is v. Then 


M'M =n F `) , det (M'M) =n?(1 — r?), 
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1 iaid 1 1 —r 
(M’M)7! = — ( ar 


nl—r2 


Here, we know o? = 1. So 


r 1 1 
a) = b — aoe 5 
var(a) = var(b) -I2 
A 12(1 1 2 
var(a — b) = aed) = , 
n l-r? ni—r 
mee OT Sr) 1 2 
var(a + b) = = ; 
n l-r? ni+r 


See exercise 4B14(a). If r is close to 1, then var(a + b) is reason- 
able, but the others are ridiculously large—especially var(a — b). When 
collinearity is high, you cannot separate the effects of the variables. 


Comment. If r is close to —1, then a — b is the parameter that can be 
reasonably well estimated. When there are several explanatory variables, the 
issue is the multiple R? between each variable and all the others. If one of 
these R?’s is high, we have collinearity problems. 


9. (a) and (b) are false, (c) is true. 

10. Let’s set up the design matrix with one column for X, another for W, 
and no column of 1’s, i.e., no intercept. We’ll have a row for each 
observation. Then the OLS assumptions are satisfied. That is why no 
intercept is needed. If X and W are perfectly correlated, the computer 
will complain: the design matrix only has rank 1. See question 8. 


Terminology. “Fitting a regression equation,” “fitting a model,” and “running 
a regression” are (slightly) colorful synonyms for computing OLS estimates. 
A fitted regression equation is y = xB, where y is scalar, and x isa 1 x p 
row vector. This expresses y as a linear function of x. 


11. If W; is independent of X;, dropping it from the equation creates no bias, 
but will probably increase the sampling error: the new disturbance term 
is W;b + €;, with larger variance than the old one. If W; and X; are 
dependent, Tom’s estimate is subject to omitted-variable bias, because 
the disturbance term W;b + €; is correlated with X;, 


Here are details on bias. Write X for the vector whose ith coordinate is 
Xi; likewise for Y and W. From exercise 3B15, Tom’s estimator will be 
a= X-Y/||X||?. Now X- Y =al[X|? + bX-W+X-e. So 


a@—a=bX-W/||X| + X-€/||X|P. 
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12. 
13. 
14. 


By the law of large numbers, X- W = nE(X;W;), and |X ||? = nE(X?). 
By the central limit theorem, X» € will be something like ,/n E(X > E (e?) 


in size. With a large sample, X- €/||X || = 0. Tom is left with omitted- 
variables bias that amounts to bE (X; W;)/E (X?): his regression of Y 
on X picks up the effect of the omitted variable W. 


See the answer to 10. 
See the answer to 10. 


The assertions about the limiting behavior of Q’ QO and Q’Y follow from 
the law of large numbers. For example, the 2,1 element in Q’Y/n is 
L Jka WiYi — E(W;Y;). Write L and M for the limiting matrices, so 
Q'Q/n —> Land Q'Y/n —> M. Check that 


1 c a 
t bof a) and Mee gs) 


For example, M21 = ac + eo” because 


E(W;Y;) = E[(cX; + dôi + e€;) (aX; + &)] = ac + eo”. 


1 e +d fea" c 
E F i D C 
Now det L = d~+e*o*, L = aa ( = i} 


L!M = 


d? + e2o? 

When Dick includes a variable that is correlated with the error term, his 
estimator will have endogeneity bias, which is —ceo? /(d? + e?o?) in 
this example. 

Exogenous variables are independent of error terms; endogenous vari- 
ables are dependent on error terms. Putting endogenous variables into 
regression equations is bad. It’s often quite hard to tell whether a variable 
is endogenous or exogenous, so putting extra variables into the equation 
is risky. Chapter 9 discusses techniques for handling endogeneity, but 
these depend on having a stockpile of variables known to be exogenous. 


For part (c), putting another variable into the equation likely reduces the 
sampling error in the estimates, and guards against omitted-variables 
bias. On the other hand, if you do put in that extra variable, endogeneity 
bias is a possibility, and collinearity may be more of a problem. 
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15. Bias is likely and the standard errors are not trustworthy. Let {X;, Y; : 
i = 1,...,n} be the sample. Nothing says that E(Y;|X;) = a + bX;. 
For instance, suppose that in the population, y; = x;°. 

Comment. The bias will be small when n is large, by the law of large numbers. 

Even then, don’t trust the standard errors: see (10) for the reason. 


16. Lung cancer rates were going up for one reason, the population was 
going up for another reason. This is association, not causation. 


Comment. Lung cancer death rates for men increased rapidly from 1950 to 
1990 and have been coming down since then; cigarette smoking peaked in 
the 1960s. Women started smoking later and stopped later, so their death 
rates peaked about 10 years after the men. The population was increasing 
steadily. 
(: Maybe crowding affects women more than men :) 
17. Answer omitted. 
18. (i) is the model and (ii) is the fitted equation; b is the parameter and 0.755 
is the estimate; €; is an unobservable error term and e; is an observable 
residual. 


Comment. In (i), the “mean” of e; is its expected value, E(€;) = 0; the 
“variance” is E (e?) = 0°: we're talking about random variables. In (ii), the 
n 2 


mean of the e;’s is + )~"_, e; = 0 and the variance would be + )~"_, e?: we’re 
n i=l n i=l “i 


talking about data. Ambiguity is resolved by paying attention to context. 
19. The sample mean (iv) is an unbiased estimator for E (X1). 
20. The statements are true, except for (c) and (e). 


21. Analysis by treatment received can be severely biased, if the men who 
accept screening are different from the ones who decline. Analysis by 
intention to treat is the way to go (section 1.2). 


Comment. Data in the paper can be used to do the intention-to-treat analysis 
(see table below). Screening has no effect on the death rate. Apparently, the 
kind of men who accept screening are at lower risk from the disease than 
those who refuse, as noted by Ruffin (1999). The US Preventive Services 
Task Force (2002) recommends against routine PSA screening. 


Invitation Group Control Group 
Number of Death Number of Death 
men deaths rate men deaths rate 


Screened 7348 10 14 1122 1 9 
Not screened 23785 143 60 14231 74 52 
Total 31133 153 49 15353 75 49 


Data from figure 4 in Labrie et al (2004); deaths due to prostate cancer. 
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22. In table 1.1, the rate for refusers is lower than for controls. 
Chapter 5 Multiple Regression: Special Topics 


Exercise Set A 


Answers are omitted. 


Exercise Set B, Chapter 5 


1. Let cbe px 1 with G~'/*Xc = 0. Then Xc = 0 by exercise 3D7; and 
c = 0 because X has full rank. Consequently, G~!/*.X has full rank, 
and so does X’G~!X by exercise 3B 10. 


Answers to 2 and 3 are omitted 


Exercise Set C, Chapter 5 


1. To set this up in the GLS framework, stack the U’s on top of the V’s: 


Vn 

The design matrix X is an (m +n)x1 column vector of 1’s. The random 
error vector € is (m+n) x1, as in the hint. The parameter « is scalar. The 
matrix equation is Y = X@ + e. Condition (2) does not hold, because 
o? Æ t?. Condition (7) holds. The (m-+n)x(m-+n) matrix G vanishes 
off the diagonal. Along the diagonal, the first m terms are all o?. The 
last n terms are all t7. So G7! is also a diagonal matrix. The first m 
terms on the diagonal are 1/07 while the last n terms are 1/17. Check 
that 


n 


i <= 1 
Im—l1 xz 5 A 
X'G Y=? UD Vi, 
i=l j=l 


a scalar; and 


VG ee ge 
o? T? 
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another scalar. Use (10): 


This is real GLS not feasible GLS: the covariance matrix G is given, 
rather than estimated from data. Notice that @GLs is a weighted average: 
observations with bigger variances get less weight. 


If o? and t? are unknown, they can be estimated by the sample vari- 
ances; the estimates are plugged into the formula above. Now we have 
feasible GLS not real GLS. (This is actually one-step GLS, and iteration 
is possible.) 


Part (a) is just theorem 4.1: the OLS estimator minimizes the sum of 
squared errors. Part (b) follows from (9): Gj; = Ac;, and the off- 
diagonal elements vanish. So, the ith coordinate of G—'/2Y is Y; [JV Aci. 
The ith row of G~!/? Xy is X;y /./Ac;. GLS is OLS, on the transformed 
model. To find the GLS estimate, you need to find the y that minimizes 
> [i —Xiy)/ Vici = X; (Yi —Xiy)*/ (Aci); compare (9). Part (c) 
is similar; this is like example 1, with Ij; = c; and [;; = 0 when i Æ j. 


To set this up in the GLS framework, let Y stack the Y;;’s. Put the 3 
observations for subject #1 on top; then the 3 observations for #2; ...; 
at the bottom, the 3 observations for #800. It will be a little easier to 
follow the math if we write Y; ; instead of Y;;: 


Yi 
Y1,2 
Y1,3 
Y21 
Y2.9 
Y=1 Yo3 


Y00,1 
Yg00,2 
Yg00,3 


This Y is 2400 x 1. Next, the parameter vector 6 stacks up the 800 fixed 
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effects a;, followed by the parameter b: 


This 6 is 801 x 1. The design matrix has to be 2400 x 801. The first 
800 columns have dummy variables for each subject. A dummy variable 
is 0 or 1. Column 1, for instance, is a dummy variable for subject #1. 
Column 1 equals 1 for subject #1, and is O for all other subjects. This 
column stacks 3 ones on top of 2397 zeros: 


1 
1 
1 
0 
0 
0 


O.:-. 


0 
0 


Column 2 equals 1 for subject #2, and is O for all other subjects. This 
column stacks 3 zeros, then 3 ones, then 2394 zeros: 


=... OC CO 


And so forth. Column 800 equals 1 for subject #800, and is 0 for all 
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other subjects. Column 800 stacks 2397 zeros on top of 3 ones: 


ooocoocoo 


(Cao ere 


The 801 st—and last—column in the design matrix stacks up the Z;;: 


Zi 
Z1,2 
Z1,3 
22,1 
Z2,2 
22,3 


Z3800, 1 
Z300,2 
Z300,3 


Let’s call this design matrix X (surprise). Here’s what X looks like when 
you put the pieces together: 


Z1,1 
Z1,2 
Z13 
22,1 
Z2,2 
22,3 


oc: oO a= eme 
ere KF OC oO 
ooococo 


O ... 1 Zgoo1 
O ... 1 Zegoo,2 
0 0... I Z800,3 


The matrix equation is Y = X£ + €. The dummy variables work with 
the fixed effects a; and get them into the equation. 
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Assumption (2) isn’t satisfied, because different subjects have different 
variances. But (7) is OK. The 2400 x 2400 covariance matrix G is 
diagonal. The first 3 elements on the diagonal are all of, corresponding 
to subject #1. The next 3 are all o2, corresponding to subject #2. And 
so forth. If we knew the o’s, we could use GLS. But we don’t. Instead, 
we use feasible GLS. 


(i) Fit the model by OLS and get residuals e. 

(ii) Estimate the ož. For instance, ô? = (et + e? + e2) /2, os — 
(ej +e + €2)/2, ..., Fg = (C5398 + e3399 + €3400)/2. It’s better 
to use 2 as the divisor, rather than 3, if you plan to get SEs from 
(14); for estimation, the divisor doesn’t matter. 

(iii) String the ô? down the diagonal to get G. 

(iv) Use (13) to get the one-step GLS estimator. 

(v) Iterate if desired. 


Exercise Set D, Chapter 5 


1. 


Ê is the sample mean and 6? is the sample variance, where you divide 
by n — 1 rather than n. By theorem 2, the sample mean and sample 
variance are independent, B — B is N(0, o?/n), and G? is distributed 
as ox? s/n — 1). Finally, nÊ — ß)/ô is t with n — 1 degrees of 
freedom. 


Comments. (i) SÈ of B is &/J/n. (ii) The joint distribution of Ê — B and 6? 
doesn’t depend on £, by exercise 4B5. (iii) When p = 1 and the design matrix 
is just a column of 1’s, theorem 2 gives the joint distribution for the sample 
mean and variance of X1, ..., Xn, the X; being IID normal variables—a result 
of R. A. Fisher’s. (iv) The result doesn’t hold without normality. However, 
for Ê and t, the central limit theorem comes to the rescue when the sample is 
reasonably large. The distribution of ĉ? generally depends on fourth moments 
of the parent distribution. 


2. (a) True: 3.79/1.88 = 2.02. 


(b) True: P < 0.05. If you want to compute P, see page 309. 

(c) False: P > 0.01. 

(d) This is silly. In frequentist statistics, probability applies to random 
variables not parameters. Either b = 0 or b Æ 0. 

(e) Like (d). 

(f) True. This is what the P-value means. Contrast with (e). 

(g) True. This is what the P-value means. Contrast with (d). 

(h) This is silly. Like (d). Confidence intervals are for a different game. 

(i) False: see G)-(k). 
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(j) True. You need the model to justify the probability calculations. 

(k) True. The test assumes the model Y; = a+ bX; + Ziy + €i, with 
all the conditions on the e;’s. The test only asks whether b = 0 or 
b £0. The hypothesis that b = 0 doesn’t fit the data as well as the 
other hypothesis, b Æ 0. 


If exercise 2 covers unfamiliar ground, read chapters 26-29 in Freedman- 
Pisani-Purves (2007); confidence intervals are discussed in chapters 21 and 23. 


3. The philosopher is a little mixed up. The null hypothesis must involve 
a statement about a model. Commonly, the null restricts a parameter in 
the model. For example, here is a model for the philosopher’s coin. The 
tosses of the coin are independent. In the first 5000 tosses, the coin lands 
heads on each toss with probability pı. In the last 5000 tosses, the coin 
lands heads on each toss with probability p2. Null: pı = p2. (That’s 
the restriction.) Alternative: pı # p2. Here, pı and p2 are parameters, 
not relative frequencies in the data. The data are used to test the null, 
not to formulate the null. 


If |p; — p2| is larger than what can reasonably be explained by “ran- 
dom fluctuations,” we reject the null. The estimates p1, p2 are relative 
frequencies in the data, not parameters. The philosopher didn’t pick up 
the distinction between parameters and estimates. 


Comment. The null is about a model, or the connection between data and a 
model. See chapters 26 and 29 in Freedman-Pisani-Purves (2007). 


4. Both statements are false. It’s pretty safe to conclude that B2 Æ 0, but 
if you want to know how big it is, or how ' big £ bo is, look at bo. The 
significance level P is small because t = bo /SE i is big. That could be 
because pri is big, or because SE is small (or both). For more discussion, 
see chapter 29 in Freedman-Pisani-Purves (2007). 


Exercise Set E, Chapter 5 


1. Use the t-test. This is a regression problem with p = 1. The design 
matrix is a column of 1’s. See exercise 5D1. (The F-test is OK too: 
F = 17, with 1 and n — 1 degrees of freedom.) 

2. This is exercise 1, in disguise: 6; = Uj; — a. 

3. Forsmalln, don’tuse the t-test unless errors are normal. With reasonably 
large n, the central limit theorem will take care of things. 


Comment. Without normality, if n is small, you might consider using “non- 
parametric methods.” See Lehmann (2006). 
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4. Use the F-test with n = 32, p = 4, po = 2. (See example 3, where 
p = 5.) The errors would need to be IID with mean 0 and finite variance. 
Normality would help but is not essential. 

5. XB? + lel? = YI? = XB IP + le IP. 

Georgia’s null hypothesis has pop = p — 1: all coefficients vanish but 
the intercept. Her B consists of Y stacked on top of p — 1 zeros. In 
the numerator of the F-statistic, 


A A a = A 
XB? — [XB |? = |X BI? —nY = nvar(XB) : 


exercise 4B13. But var(X Â) = R2var(Y) and var(e) = (1 — R2)var(Y) 
by (4.22—24). The numerator of the F-statistic therefore equals 


X B2 __ X B(s) 2 
IXB NAB E on R2var(Y). 
p-1l —1 


Because é = 0, the denominator of F is 


lel? n n 2 
= var(e) = —— (1 — Rî)var (Y). 
n-p n-p n—p 
So 
_n—p R? 
~ p—11—R? 


Exercise Set F, Chapter 5 


1. The ĝ’s are dependent. 
2. e L Ŷ so ||¥||? = |I? + llell?.. From the definition, 1 — R? = 
(Y1? — WY U7) IV? = lel? Y1. 


Discussion questions, Chapter 5 
1. Lete; = X; — E(X;). Use OLS to fit the model 


(a)-f 2)(2)+(2). 


There is no intercept. 
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The F-test assumes the truth of the model, and tests whether some group 
of coefficients are all 0. The F-test can be used to see if a smaller model 
is OK, given that a larger model is OK. (But then how do you test the 
larger model? Not with the F-test?!?) 


Both assertions are generally false. As (4.9) shows, E (Ê X)- g = 
(X'X)' X'y. This won’t be 0, unless y L X. According to (4.10), 
cov(B|X) = (X'X)~!X’GX(X'X)~!. There is no ø? in this formula: 
the e; have different variances, which appear on the main diagonal of G. 
If you want (a) and (b) to be true, you need y L X and G = o? Tatas 


ĝi is biased, bo unbiased. This follows from exercise 3B17, but here 
is a better argument. Let c Æ 0. Suppose that y; = c for all i. The 
bias in B is c(X’X) 7! XI nye by (4.9). Let u be the p x 1 vector which 
is all 0’s, except that u; = 1. Then c(X’X)~!X’1nx1 = cu, because 
X'Inx1 = X'Xu. This in turn follows from the fact that Xu = In x1: 
the first column of X is all 1’s. 


Comment. There is some opinion that E (€;) # Ois harmless, only biasing B E 
True enough—if E(é;) is the same for all i. Otherwise, there are problems. 
That is the message of questions 3—4. 


X 


(a), (b), (c) are true: the central limit theorem helps with (c), because 
you have 94 degrees of freedom. (d) is false: with 4 degrees of freedom, 
you need normal errors to use t. 

cov(X;, Y;) = 0. If Julia regresses Y; on X;, the slope will be 0, up 
to sampling error. She will conclude there is no relationship. This 
is because she fitted a straight line to curved data. Of course, if she 
regressed Y; on X; and X;?, she’d be a heroine. 

The X; all have the same distribution—normal with mean u and vari- 
ance 2. The X; are not independent: they have U in common. Their 
mean X = u +U +V is N(u, 1 + 1). Thus, |X — j| is around 1. 
Next, X; — X = V; — V, so s? is the sample variance of V;,..., Vj, and 
sn x2 ı/(n — 1) = 1. So, (e) is false: sampling error in X is much 
larger than s/./n. 

Issues. (i) U is random, although it does not vary across subjects. 
(ii) V is the average not the expectation. Indeed, V will be on the order 
of +1/,/n, while E(V) is exactly 0. 


Moral. Without independence, s/,/n isn’t good for much. 


(a) is true and (b) is false, as shown in 7. You need to assume indepen- 
dence to make the usual statistical calculations. 
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9. 


10. 


11. 


12. 


(a) is true. With reasonably large samples, normality doesn’t matter 
so much. The central limit theorem takes care of things. However, as 
shown in the previous exercises, claim (b) is false—even for normal 
random variables. You need to assume independence to make the usual 
Statistical calculations. 


The social scientist is a little mixed up. The whole point of GLS is to 
downweight observations with high variance—and on the whole, those 
are the observations that are far from their expected values. Feasible 
GLS tries to imitate real GLS: that means downweighting discrepant 
observations. If G in (13) is a good estimate for G in (7), then FGLS 
works like a champ. If not, not. 


Putting Z into the equation likely reduces the sampling error in the 
estimates, and guards against omitted-variables bias. On the other hand, 
if you do put in Z, endogeneity bias is a possibility. 

(: Omitted-variables bias + Endogeneity bias = Scylla + Charybdis :) 
Something is wrong. The SE for the sample mean is 110/25 = 2.10, 
sot = 5.8/2.1 = 2.76 and P = 0.01. 


Chapter 6 Path Models 


Exercise Set A 


1. 
2s 


Answer omitted. 


a is a parameter; the numbers are all estimates. The 0.753 is an estimate 
for the standard deviation of the error term 7 in equation (3). 


True. Variables are standardized, so the residuals automatically have 
mean 0. The variance is the mean square. With these diagrams, it is 
conventional to divide by n not n — p: this is fine if n is large and p is 
small, which is the case here. 


In matrix notation, after fitting, we get Y = XB + e where e L X. So 
IYI? = IXI? + llel’. In particular, Jel? < IYI? and |lel|?/n < 
IY ||?/n. Since Y is standardized, ||Y |? /n = 1. 


Comments. (i) Here, it is irrelevant that X is standardized. (ii) If we divide 
by n — p rather than n, then var(e) may exceed 1. 


5: 
6. 


The SD. Variance is the wrong size (section 2.4). 


These arrows were eliminated by assumption. (On the other hand, if 
you put them in and compute the path coefficients from table 1, they’re 
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pretty small.) There could in some sense be an arrow from Y to U, 
because people train themselves in order to get certain kinds of jobs. (A 
dedicated path analyst might respond by putting in “plans” as a latent 
variable driving education and occupation.) 

Intelligence and motivation are mentioned in the text—as are mothers. 
Other possibilities include race, religion, area of residence, .... 

When Y; is lumpy, Y = Xf + € isn’t good, because X; + €i can 
usually take a lot of different values—f varies and €; is random additive 
noise—whereas Y; takes only a few values. 


Exercise Set B, Chapter 6 


2; 


v is for data, ø? is for random variables. 


Write the model as y; = a + bx; + «;; the €; are IID with mean 0 and 
variance o7, fori = 1,...,n. The fitted equation is 


yi = a+ bxi + ei. (*) 


Now 
n n 


l =\2 2_1 2 
v=- 0-7) and s =-) e. 


i=l i=l 
Next, y = â + bx because @ = 0. Then ¥-V= b(x; — X) + ei. The 


sample covariance between x and y is 


PG X) D= D a -Dlo —¥) +e] = bv 


i=l 


because € = 0 and e L x. Similarly, the sample variance of y is 
var(y) = b?v +s. The standardized slope is the correlation between 
x and y, namely, 


cov(x, y) bv b/v 


Jarava) (vv +s?) 7 (bv +s? l 


Suppose b is positive (as it would be for a spring). If ø? is small, the 
right side of (9) will be nearly 1, which tells us that the data fall along 
a straight line. This is fine as far as it goes, but is not informative about 
the stretchiness of the spring. 
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Exercise Set C, Chapter 6 


1. The —0.35 is an estimate. It estimates the parameter £2 in (10). 
No answers supplied for exercises 2-4. 
5. Lumpiness makes linearity a harder sell; see exercise A8. 
6. Y; = â + bU; + êV; + ei, so Y = â + bU + éV and Y; — Y = 
b(U; — U) + ĉ(V; — V) + ei. Then 


sy U; —U av Ve. ei 


Y-Y . 
=b t 


SY SY SU SY SV SY 


The standardized coefficients are bsy /sy and ĉsy /sy. 


Comment. If you normalize ô? by n — p, the f-statistics for b are the same 
whether you standardize or not. Ditto for ĉ. If you want standardized coeffi- 
cients to estimate parameters, the setup is explained in 


http://www.stat.berkeley.edu/users/census/standard.pdf 


Exercise Set D, Chapter 6 
1. (a) 450 +30 = 480. (b) 450+ 60 = 510. 
(c) 450+ 30 = 480. (d) 450 + 60 = 510. 


Comment. You get the same answer for both subjects. That assumption is 
built into the response schedule. 


2. All she needs is observational data on hours of coaching and Math SAT 
scores for a sample of coachees—if she’s willing to assume the response 
schedule and exogeneity of coaching hours. Exogeneity is the additional 
assumption. She would estimate the parameters by running a regression 
of Math SAT scores on coaching hours. 


Comments. (i) Response schedules and exogeneity are very strong assump- 
tions. People do experiments because these assumptions seem unrealistic. 


(ii) The constant intercept is particularly unattractive here. Some researchers 
might try a fixed-effects model, Y; x = a; + bx + ôi. The intercept a; varies 
from one coachee to another, and takes individual ability into account. Some 
investigators might assume that there was no relationship between a; and the 
amount of coaching taken by i—although this assumption, like the constancy 
of b, is not completely plausible. The “no-relationship” assumption can be 
implemented in a random-effects model, where a; is chosen at random from 
a population of possible intercepts. This is equivalent to the model we began 
with, with a the average of the possible intercepts; a; —a goes into 5;. In many 
contexts, random-effects models are fanciful (Berk and Freedman 2003). 
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Exercise Set E, Chapter 6 


1. There would be two free arrows, one pointing into X and one into Y, 
representing the error terms in the equations for X and Y, respectively. 
The curved line represents association. There are two equations: X; = 
a+bU;+cV;+6; and Y; = d+eU;+ f X;+e¢;. We assume that the 5’s are 
IID with mean 0 and variance o?; the €’s are ID with mean 0 and variance 
t?; the 5’s are independent of the e’s. The parameters are a, b, c, d,e, f, 
also o? and t?. You need Uj, V;, Xi, Y; for many subjects i, with the 
U’s and V’s independent of the 5’s and €’s (exogeneity). You regress 
X on U, V, with an intercept; then Y on U, X, again with an intercept. 
There is no reason to standardize. 


For causal inference, you would need to assume response schedules: 


Xiu v =at+but+cvt+ ôi, (*) 


Yi u,v, =d +eu+ fx + éi. (x) 


There is no v on the right hand side of (xx) because there is no arrow 
leading directly from V to Y. You would need the usual assumptions on 
the error terms, and exogeneity. 


You could conclude qualitatively that X affects (or doesn’t affect) Y, 
depending on the significance of f. You could conclude quantitatively 
that if X is increased by one unit, other things being held equal (namely, 
U and V), then Y would go up f units. 


2. Answer omitted. 


3. You just regress Z on X and Y. Do not standardize: for instance, you 
want to estimate e. The coefficients have a causal interpretation, in view 
of the response schedule. See section 6.4. 


4. (a) False: no arrow from V to Y. (b) True. (c) True. (d) False. 


5. (a) Use (17). The answer is b. 
(b) Use (18). The answer is (13 — 12)d + (5 — 2)e = d+ 3e. 


Comments. (i) By assumption, intervening doesn’t change the parameters. 
(ii) The effects would be estimated from the data as b and d +3ê, respectively. 


6. Disagree. The test tries to tell you whether an effect is zero or non-zero. 
It does not try to tell you about the size of the effect. See exercise 5D4. 
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Discussion questions, Chapter 6 


l. 


You don’t expect much change in Mrs. Wang. What the 0.57 means is 
this. If you draw the graph of averages for the data (figure 2.2), the dots 
slope up, more or less following a line—the regression line. That line 
has slope 0.57. So, let’s fix some number of years of education, call it 
x, and compare two groups of women: 


(i) all the women whose husband’s educational level was x years, and 
(ii) all the women whose husband’s educational level was x + 1 years. 


The second group should have higher educational level—higher by 
around 0.57 years, on average (Freedman-Pisani-Purves, 2007, §10.2). 
(a) True. (b)True. (c) True. (d)True. (e) False. 

The computer is a can-do machine. It runs the regressions whether 
assumptions are true or false. (Although even the computer has trouble 
if the design matrix is rank-deficient.) The trouble is this. If errors 
are dependent, the SEs that the computer spits out can be quite biased 
(section 4.4). If the errors don’t have mean 0, bias in Ê is another big 
issue. 


(a) country, 72. 

(b) IID, mean 0, variance o”, independent of the explanatory variables. 

(c) Can’t get â or the other coefficients without the data. You can 
estimate the standardized equation from the correlations. 

(d) Controlling for the other variables reversed the sign. 

(e) The t-statistics (and signs) will be the same in the standardized 
equation and the raw equation—you’re just changing the scale. See 
exercise 6C6. 

(f) Not clear why the assumptions make sense, or where a response 
schedule would come from. What intervention are we talking 
about?? Even if we set all such objections to one side, it is very 
odd to have CV on the right hand side of the equation. Presumably, 
as modelers would see things, CV is caused by PO; if so, it’s en- 
dogenous. If you regress PO on FI and EN only, then FI has a tiny 
beneficial effect. If you regress CV on PO, FI, and EN (or just on FI 
and EN), then FI has a strong beneficial effect. The data show that 
foreign investment is harmful only if you insist on a set of rather 
arbitrary assumptions. 

Take two people i and j in the same ethnic group, living in the same 

town: ôi = Y; = Xi fp, Ôj = Y; = XB, and ôi GE Ôj = (Xj T XDE 

because Y; = Y;. In this model, independence cannot be. The standard 
errors and significance levels aren’t reliable. The analysis is off the rails. 
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The diagram unpacks into five regression equations. The first equation 
is GPQ = aABILITY +ô. The second is PREPROD = bDABILITY +€. 
And so forth. The usual assumptions are made about the error terms. 
The numbers on the arrows are estimated coefficients in the equations. 
For instance, â = 0.62, b = 0.25, etc. 


The good news—from the perspective of Rodgers and Maranto—must 
be the absence of an arrow that goes directly from GPQ to CITES, and 
the small size of the path coefficients from GPQ to QFJ and QFJ to 
PUBS or CITES. People will cite your papers even if you didn’t get 
your PhD from a “prestigious graduate program.” The bad news seems 
to be that GPQ has a positive indirect effect on CITES through QFJ. If 
two researchers are equal on SEX and ABILITY, the one with the PhD 
from Podunk University will have fewer CITES. 


The news is less than completely believable. First of all, this is a very 
peculiar sample. Who are the 86+ 76 = 162 people with data? Second, 
what do the measurements mean? (For instance, ABILITY is all based 
on circumstantial evidence—where the subject did the undergraduate 
degree, what others thought of the subject as an undergraduate, etc.) 
Third, why should we believe any of the statistical assumptions? Just 
to take one example, PREPROD is going to be a small whole number 
(0, 1,2,...), and mainly 0. How can this be the left hand side variable 
in a regression equation? Next, well, maybe that’s enough. 


The data are inconsistent—measurement error. Let a be the exact weight 
of A, b the exact weight of B, etc. It will be easier to use offsets from 
a kilogram, so a = 53 ug; b is the difference between the exact weight 
of B and 1 kg, etc. The parameters are b, c, d. The first line in the table 
says a+b—c—d-+6, = 42, where 6; is measurement error. So 
b—c—d+6,; = 42 — a = —11. The second line in the table says 
a—b+c—d+62 = —12, so -b+c—d+65. = —12 — a = —65. 
And so forth. Weights on the left hand balance pan come in with a plus 
sign; on the right, with a minus sign. We set up the regression model in 
matrix form as follows: 


—11 yi =f = ôi 
—65 = es -= dS 


—12 
+36 Gat +a ôs 
+64 1H H 86 


In the last three rows, A is on the right, so you have to add 53 to the 
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difference, not subtract. Assume the 6’s are IID with mean 0 and variance 
o”—which in this application seems pretty reasonable. OLS gives b= 
33, C = 26, ĉ = 44. The SEs are all estimated as 17. (There is a lot of 
symmetry in the design matrix.) Units are ug. 


Comment. You can do the regression with a pocket calculator, but it’s easier 
on the computer. 


7. 
8. 


10. 
11. 


12. 


Answer omitted. 


The average response of the subjects assigned to treatment at level O is 
an unbiased estimate of ao. This follows from question 7. The subjects 
assigned to treatment at level 0 are a simple random sample of the pop- 
ulation; the average of a simple random sample is an unbiased estimate 
of the population average. Likewise for œ1ọ and as. You can’t get «75 
without assuming a functional form for the response schedule—another 
reason why people model things. On the other hand, if you get the 
functional form wrong... . 


Randomization doesn’t justify the model. Why would the response be 
linear? For example, suppose that in truth, y;o = 0, yi,10 = 9, yi,so = 
3, yi,75 = 3. There is some kind of threshold, then the effect saturates. 
If you fit a straight line to the data, you will look pretty silly. Zf the linear 
model is right, yes, you can extrapolate to 75. 


Like 9. See Freedman (2006b, 2008a) for additional discussion. 

If E(X;€;) = 0, OLS will be asymptotically unbiased. If E (e;|X;) = 0, 
OLS will be exactly unbiased. Neither of these conditions is given. 
For instance, suppose p = 1, the Z; are ID N(0, 1). Let X; = Zi, 
ci = Z;?, and Y; = Xib +e; = BZ, + Z;?, where B is scalar. By 
exercise 3B15 and the law of large numbers, the OLS estimator is 


EZ) 
BO 


wi XY; 
2 
Dix; 


Diz 
T 

LZ; 
The bias is about 3 because E(Z}) =3; E(Z?) = |. See end notes to 
chapter 5. 


p= =£ > B4 


Experiments are the best, because they minimize confounding. How- 
ever, they are expensive, and they may be unethical or impossible to do. 
Natural experiments are second-best. They’re hard to find, and data col- 
lection is expensive. Modeling is relatively easy: you control (or look 
like you’re controlling) for many confounders, and sometimes you get 
data on a good cross section of the population you’re interested in. This 
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point is worth thinking about, because in practice, investigators often 
have very funny samples to work with. On the other hand, models need 
a lot of assumptions that are hard to understand, never mind verify. See 
text for examples and discussion. 


13. False. The computer only cares whether the design matrix has full rank. 
14. ê must be 0, because Y = XB by definition. 


15. The OLS assumptions are wrong, so the formulas for SEs aren’t trust- 
worthy. 


Discussion. The coefficient of X;? in the definition of e; = X;+—3X;7 makes 
E(e€;) = 0. Odd moments of X; vanish by symmetry, so E(X;€;) = 0. The 
upshot is this. The «e; are IID, and E(X;) = E(e;) = E(Xie;) = 0. So 
E{[Y; —a—bX;}} = E{[-a+ (1—b)X; + €i} = a? + (b — 1)* +-var(e) 
is minimized when a = 0 and b = 1. In other words, the true regression 
line has intercept 0 and slope 1. The sample regression line is an estimate 
of the true regression line. But e€; is totally dependent on X;. So the OLS 
assumptions break down. When applied to the slope, the usual formula for 
the SE is off by a factor of 3 or 4. (This is easiest to see by simulation, but 
an analytic argument is possible.) 


The scale factor 0.025 was chosen to get the high R*, which can be computed 
using the normal moments (end notes to chapter 5). Asymptotically, the 
sample R? is 

cov(X;, Yi) 72 
B a i (s) 
This follows from the law of large numbers: the sample mean of the X; 
converges to E(X;), likewise for the sample mean of Y;, and the second- 
order sample moments. The quantity (*) equals 


cov(X;, YD)? 1 


= = 0.9744. 
var(Xi)var(¥i) 1 + 0.0252 E[(Xi4 — 3X;7)"] 


Conclusion: R? measures goodness of fit, not validity of model assumptions. 
For other examples, see 


http://www.stat.berkeley.edu/users/census/badols.pdf 
16. Therelationship is causal, but your estimates will be biased unless p = 0. 


17. Choose (i) and (iv), dismiss the others. The null and alternative hypothe- 
ses constrain parameters in the model. See answer to exercise 5D3. 


18. 24.6, 29.4 = 5.4. Take the square root to get the SD. 
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19. 


20. 


The quote confuses bias with chance error. On average, across the 
various splits into treatment and control, the two groups are exactly 
balanced: no bias. With a randomized controlled experiment, there is 
no confounding. On the other hand, for any particular split, there is likely 
to be some imbalance. That will be part of the chance error in estimated 
treatment effects. Moreover, looking at a lot of baseline variables almost 
guarantees that some differences will be “significant” (section 5.8). 


No. Use the Gauss-Markov theorem (section 5.2). 


Chapter 7 Maximum Likelihood 


Exercise Set A 


1. 


No coincidence. When the random variables are independent, the like- 
lihood function is a product, so the log likelihood function is a sum. 


No answer supplied for (a). For (b), P(U < y) = ee (z)dz and 
P(-U <y)=P(U>-y)= I: o(z)dz, where ¢ is the standard 
normal density. The integrals are areas under ¢, which is symmetric; 
the areas are therefore equal. More formally, change variables in the 
second integral, putting w = —z. 

The MLE is S/n, where S is Binomial(n, p). This is only asymptotically 


normal. The mean is p and the variance is p(1 — p)/n. 


The MLE is S/n, where S is Poisson(nA). This is only asymptotically 
normal. The mean is A and the variance is A/n. Watch it: S/n isn’t 
Poisson. 


Comment. The normal, Poisson, and binomial examples are exponential 
families in the “mean parameterization.” In such cases, the MLE is unbiased 
and option (i) in the theorem gives the exact variance. Generally, the MLE is 
biased and the theorem only gives approximate variances. 


5. 


P{0U/(1 — U) > x} = P{U > x/@4+x)} = 1 -— [x/(0 + x)] = 
6/(@ +x), so the density is 0 /(0 + x)?. This is one way to construct the 
random variables in example 4, section 7.1. 


The likelihood is 0” / JJ} [(0 + X;)7]. So 


n 
Ln (0) =nlogð -29 log(6 + Xi). 
1 
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Then 


n 
n 1 
L' (0) = 2 : 
n 0 ais 
o 9 
OL (0) =n—2 5) —— 
i 6+ X; 


À: X; L Xi 
L f I 
A a Bone 


1 


as required. But X;/(0© + X;) is a decreasing function of 0. Finally, 
O L’ (0) tends to n as 6 tends to 0, while 6L} (0) tends to —n as @ tends 
to œo. Hence 0 L’, (0) = 0 has exactly one root. 


7. The median is 6. 

8. The Fisher information is 07? — 20 J @+ x)~4dx. 

9. Let S = Xı +---+ Xn. The MLE for A is S/n, so the MLE for 
6 is (S/n)*. This is biased: E[(S/n)*] = [E(S/n)]? + var(S/n) = 
+ (A/n) = 04+ (JO/n). 

10. The MLE is ./S/n. Biased. 
Comment. Generally, if À is the MLE for a parameter A, and f is a smooth 
1-1 function, f(A) is the MLE for f(A). Even if À is unbiased, however, you 
should expect bias in f(A) unless f is linear. For math types, if X is a positive 
random variable with a finite mean, not a constant, then E(./X) < /E(X). 
Generally, if f is strictly concave, E(f(X)) < f (E(X)): this is Jensen’s 
inequality. 


11. Use the MLE. The likelihood function is 


20 >Xi 
[| eppò PD 


i=l 


You write down the log likelihood function, differentiate, set the deriva- 


tive to 0, and solve: Ê = 3-22, X;/ 078, i = 778, X;/210. 
Comment. In this exercise and the next one, the random variables are in- 
dependent but not identically distributed. Theorem 1 covers that case, as 
noted in text, although options (i) and (ii) for asymptotic variance get more 
complicated. For instance, (i) becomes { — E% [Lo] : 
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12. The log likelihood function L(q@, £) is 


[a a — BY + (Y -a -— 28) + (Z —2a — 6) 3 log(2x) |. 


Maximizing L is the same as minimizing the sum of squared residuals. 
(Also see discussion question 5.1.) 
Comment. In the OLS model with IID N(0, o°) errors and a fixed design 
matrix of full rank, the MLE for 6 coincides with the OLS estimator and is 
therefore unbiased (theorem 4.2). The MLE for ø? is the mean square of the 
residuals, with division by n not n — p, and is therefore biased (theorem 4.4). 
13. c(@) = 0: that’s what makes Vio Po{X; = j} = 1. Use the MLE 
to estimate 6. (You should write down the log likelihood function and 
differentiate it.) 
14. L,(@)=—- Di |X; —0|— 2 logn. This is maximized (because the sum 
is minimized) when 0 is the median. See exercise 2B18. 


Exercise Set B, Chapter 7 


1. All the statements are true, except for (c): the probability is 0. 


2. (a) Xi is the 1 x4 vector of covariates for subject i, namely, 1, ED;, 
INC;, MAN;. And £ is the 4x 1 parameter vector in the probit 
model: see text. 

(b) random, latent. 
(c) The U; should be IID N(0, 1) and independent of the covariates. 
(d) sum, term, subject. 
3. False. The difference in probabilities is 
(0.29) — (0.19) = 0.61 — 0.57 = 0.04. 


Exercise Set C, Chapter 7 


1. E(X) = u so pis estimable—the estimator is X. Next, var(X) = o°. 


The distribution of X determines 07, so o is identifiable. Watch it: 
(i) var(X) is computed not from X but from the distribution of X. 
(ii) We only have one X, not a sample of X’s. 
2. Both parameters are estimable: E(X1) = a, E[(X2 — X1)/9] = £. 
(You would get smaller variances with OLS, but the exercise only asks 
for unbiased estimators.) 


3. If the rank is p, then £ is estimable—use OLS—hence identifiable. If 
the rank is p — 1, there will be a y 4 Op with Xy = On x1. If £ is 
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any multiple of y, we get the same distribution for Y = XB + € = €, so 
B is not identifiable. That is why the rank condition is important. 


Let 5; be IID with mean 0 and variance o”. Let €; = i + ôi. Then 
Y = (XB +u) +8. So X + wis estimable. But the pieces Xf, u can’t 
be separated. So, 6 isn’t identifiable. 


Let’s take this more slowly. We choose values 6“) and u® for 6 and 
u. Then we choose another value B® + 6 for p. Let 


u = xpO + pO — xg, 


so XB + w) = Xp + u®. Call the common value å. 


Next, consider a poor statistician who knows the distribution of Y— 
but not the parameters we used to generate the distribution. He cannot 
tell the difference between (B®, uw) and (B®, w). With the first 
choice, Y is normal with E(Y) = à and cov(Y) = 07 Inxn. With the 
second choice, Y is normal with E(Y) = A and cov(Y) = 07 Inxn-. 
The distribution of Y is the same for both choices. That is why £ isn’t 
identifiable. (Neither is u.) 

For regression models, the condition that E(€|X) = 0, 1 is important. 
This condition makes u = 0, x1, so 8 is identifiable from the distribution 
of Y. (Remember that X is fixed, of full rank, and observable; the error 
term € is unobservable, as are the parameters £, u, o2.) 


p? is identifiable. If p? 4 q?, then p Æ q and 
P,(Xı = 1) # P(X = 1). 


However, p° is not estimable. For the proof, let g be a function on pairs 
of 0’s and 1’s. Then Ep{g (X1, X2)} is 


p’8g(1,1) + pA — p)g(1, 0) + (1 — p)pg(0, 1) + (1 — p)?g(0, 0). 
This is a quadratic function of p, not a cubic. 


The sum of two independent normal variables is normal, so U + V is 
N(0, 0o? +t”). Therefore, o? + t? is identifiable, even estimable— 
try (U + V)? for the estimator—remember that E(U) = E(V) = 0. 
But o and t? aren’t separately identifiable. If you want to add a little 
something to o?, just subtract the same amount from t?; that won’t 
change the distribution of U + V. 


If W is N(u, 1) and X ~ |W], then E(X?) = E(|W|?) = E(W?2) = 
2 
u +l. 
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8. This question is well beyond the scope of the book, but the argument is 
sketched, for whatever interest it may have. 


|u| is not estimable. Let f be a Borel function on (—oo, co). Assume 
by way of contradiction that E[ f (u + Z)] = |u| for all u, with Z being 
N(O, 1): we can afford to set o? = 1. So 


E[ f(u +Z) = = J. f(u +z)exp(=z?/2)dz = |u|. (*) 
Let g(x) = fixe /?, Set x = u + z in (*) to see that 


CO 
J e" g(x)dx = Joe? |}, (**) 
CO 


The idea is to show that the left side of (x*) is a smooth function of m, 
which is impossible at u = 0: look at the right side of the equation! 
We plan to differentiate the left side of (*«*) with respect to u, using 
difference quotients—the value at u + h minus the value at u—with 
0 <h <1. Start with 0 < x < oo. Because h > e! is an increasing 
convex function of h for each x, 


(ut+h)x _ eux Ax _ ] 
e e e 
0< = e" eer (e — 1) < etd 
h h 
Similarly, for each x < 0, the function h —> —e”* is increasing and 
concave, so 
ehX — eluth)x 1— e”* 
0< B je  ——— eh gle: G) 


Equation (**) says that x > e*g(x) € L! for all u. Sox > 
e#* gt (x) € L! andx —> e4* g(x) € L! forall u, where gt is the pos- 
itive part of g and g~ is the negative part. Then x > e#+D* g(x) € L! 
for all choices of signs. Apply the dominated convergence theorem sep- 
arately to four cases: (i) g* on the positive half-line, (ii) gt on the 
negative half-line, (iii) g~ on the positive half-line, (iv) g~ on the neg- 
ative half-line. Equations (+) and (+) make this work. The conclusion 
is, we can differentiate under the integral sign: 


OO OO 


egw dx= | e”*xg(x)dx 


—CoO 


ðu Joo 
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where the integral on the right converges absolutely. If you look back 
at (xx), there is a contradiction: |u| is not differentiable at 0. The 
conclusion: |u] is not estimable. 


a? is not estimable from a sample of size 1. Let f be a Borel function on 
(—oo, œ). Assume by way of contradiction that E[ f (u + o0Z)] = o? 
for all n,o. Let g(x) = f(—x); then E[g(u + 0Z)] = o? for all 
u, 0 too. We may therefore assume without loss of generality that f is 
symmetric: if not, replace f by [f (x) + g(x)]/2. Take the case u = 0. 
The uniqueness theorem for the Laplace transform says that f(x) = x? 
a.e. But now we have a contradiction, because E[ (u +0 Z)?] = u? +0? 
not o7. For the uniqueness theorem, see p. 243 in Widder (1946). 


Comment. Let Z be N(0, 1). An easier version of the first part of exercise 8 
might ask if there is a function f such that E[ f(u + o Z)] = |u] for all real 
u and all o > 0. There is no such f. (We proved a stronger assertion, using 
only o = 1.) To prove the weaker assertion—which will be easier—take 
o = 0, concluding that f(x) = |x| for all x. Then take u = 0,0 = 1 to get 
a contradiction. (This neat argument is due to Russ Lyons.) 


Exercise Set D, Chapter 7 


Most answers are omitted. For exercise 1, the last step is P{F(X) < y} = 
P{X < F! (y)} = F(F-!(y)) = y. For exercise 7, the distribution is 
logistic. For exercise 11, one answer is sketched in the hints. Here is a more 
elegant solution, due to Russ Lyons. With ¢ as in exercise 9, 


La (b) = 2); (Xib) + È; XiYi)B, 


the last term being linear in £. If x, x* are real numbers, then 


(3 es G) 
by exercise 9, the inequality being strict if x Æ x*. Let B 4 B*. By (Ħ), 
+ B* Xib) + (X: B* 
Lexa p) ma p“) E 


i 


If X has full rank, there is an i with X; 4 Xiß*, and the inequality in (+) 
must be strict. Reminder: f is concave if f[(x+x*)/2] > [f(x)+ f(x*)]/2, 
and strictly concave if the inequality is strict when x 4 x*. If f is smooth, 
then f is strictly concave when f”(x) < 0. 
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A useful formula: under the conditions of exercise 10, the likelihood 


equation for the MLE is 
n 
exp(X;B) 
X Xi — pi(B)] = 0, where p; = ERE 
i=l 1 + exp(X;2). 


Exercise Set E, Chapter 7 


l. 


0.777 is an estimate for the parameter œ. This number is on the probit 
scale. Next, 0.041 is an estimate for another parameter in equation (1), 
namely, the coefficient of the dummy variable FEMALE (one of the 
covariates in X;). 


This number is on the probit scale. Other things being equal, students 

whose parents have some college education are less likely to graduate 

than students whose parents have a college degree. (Look at table 1 in 

Evans and Schwab to spot the omitted category.) How much less likely? 

The estimate is, 0.204 on the probit scale. 

(a) @. 

(b) random, latent. 

(c) The Uj, V; are IID as pairs across subjects i. They are bivariate 
normal. Each has mean 0 and variance 1. They are independent of 
all the covariates in both equations, namely, IsCat and X. But U;, V; 
are correlated within subject i. Let’s call the correlation coefficient 
p, for future reference. 


0.859 estimates the parameter œ in the two-equation model. This is 
supposed to tell you the effect of Catholic schools. The —0.053 esti- 
mates the parameter p: see 3(c) above. This correlation is small and 
insignificant, so—if the model is right—selection effects are trivial. 


Comment. The 0.777 in exercise | and the 0.859 in exercise 4 both seem to be 
estimating the same parameter œ. Why are they different? Well, exercise 1 is 
about the one-equation model and exercise 4 is about the two-equation model. 
The models are different. The two estimates are similar because ( is close 
to 0. 


5. 


sum, term, student. 


6. The factor is 


oo —X77b 
P{U7 < —X77b and V77 > —X77B} =f J (u, v) du dv. 
—X77B V= 
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There’s no a because this student isn’t Catholic. There’s no œ because 
this student didn’t attend a Catholic high school. 


7. The factor is 


P{U4039 < —a — X4039b and V4039 < —X 40398} 
—X40398 p—a—X4039b 
=| J o(u, v) du dv. 
—CoO —CoO 


Notation. Integrals are read from the inside out. Take i f ġ(u, v)du dv. 
First, you integrate with respect to u, over the range [0,1]. Then you integrate 
with respect to v, over [0,2]. You might have to squint, to distinguish a from 
a and b from £. 


8. pisin @: see equation (15). 
Presumably, the two numbers got interchanged—a typo. 


10. This is the standard deviation of the data—not the standard error: 
~ 0.97 x 0.03 = 0.17. 


The standard deviation is a useful summary statistic for quantitative data, 
not for 0’s and 1’s. 


11. Unless the design matrix is a little weird, the MLE will be close to truth, 
so you’d nail œ, 6. But even if you know g, $, you don’t know the latent 
variables. For example, suppose subject i went to Catholic school and 
graduated. According to the model, V; > —Cja — X;ß. That’s quite a 
range of possible values for V;. In this respect, the probit model is less 
satisfying than the regression model. 


Discussion Questions, Chapter 7 


1. The MLE is generally biased, but not always. In the normal, binomial, 
and Poisson examples of section 1, the MLE is unbiased. But see exer- 
cises 7A9-10 and lab 11 below. 


Comment. When the sample size is large, the bias is small, and so is the 
random error. (There are regularity conditions. . . .) 


2. The response variables are independent conditionally on the covariates. 
The covariates are allowed to be dependent across subjects. Covariates 
have to be linearly independent, i.e., perfect collinearity is forbidden: if 
one covariate was a linear combination of others, parameters would not 
be identifiable. Covariates do not have to be statistically independent, 
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nor do covariate vectors have to be orthogonal. From the modelers’ 
perspective, that is a great advantage: you can disentangle effects even 
when the causes are mixed up together in various ways. 


3. (a)-(b)-(c) are true, but (d) is false. If the model is wrong, the parameter 
estimates may be meaningless. (What are we estimating?) Even if 
meaningful, the estimates are liable to be biased; so are the standard 
errors printed out by the computer. 


4. (a)False. (b)True. (c)False. (d)True. (e) False. (f) True. 


Comment. With respect to parts (a) and (b), the model does allow the effect 
of Catholic schools to be 0. The data are used to reject this hypothesis. If the 
model is right, the data show the effect to be large and positive. 


5. Independence is violated. So is a more basic assumption—that a sub- 
ject’s response depends only on that subject’s covariates and assignment. 


6. (a) 


(b) 


(c) 


(d) 


c. You could estimate the equations by maximum likelihood. (Here, 
coaching is binary—you either get it, or not; the response Y is 
continuous.) 

The response schedule is Yj, = cx + V;B + oei, where x = 1 
means coaching, and x = 0 means no coaching. Nature generates 
the U, V, ô, € according to the specifications given in the problem. 
If U;a + ô; > 0, she sets X; = 1 and Y; = Y; x; = c + Vib +06: 
subject i is coached, and scores Y; on the SAT. If Uja + 6; < 0, 
Nature sets X; = 0 and Y; = Y; x; = ViB + o€;: subject i is not 
coached, and scores Y; on the SAT. The two versions of Y; differ by 
c, the effect of coaching. (You only get to see one version.) 

The concern is self-selection. If the smart kids choose coaching, and 
we just fit a response equation, we will over-estimate the effect of 
coaching. The assignment equation (if it’s right) helps us adjust for 
self-selection. The parameter p captures the dependence between 
X; and e;. This is just like Evans and Schwab, except that the 
outcome variable (SAT score) is continuous. 

In the selection equation, the scale of the latent variable is not iden- 
tifiable, so Powers and Rock set it to 1 (section 7.2). In the response 
equation, there is a scale parameter o. 


Comment. Powers and Rock show, without any adjustment at all, that the 
effect of coaching is small. Their tables suggest that confounding makes the 
unadjusted effect an over-estimate. On the whole, the paper is persuasive as 
well as interesting. 


7. (a) Adummy variable is 0 or 1 (section 6.6): D1992 is 1 for observations 
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in 1992 and O for other observations; it’s there in case 1992 was 
special in some way. 

(b) No. You have to take the interactions into account. If the Republi- 
cans buy 500 GRPs in year t and state i, then Rep.TV goes up by 5, 
and their share of the vote should go up by 


x [0.430 + 0.066 x (Rep. AP — Dem. AP ) + 0.032 x UN + 0.006 xRS | 


where Rep. AP is evaluated in year ¢ and state i, and likewise for 
the other variables. 


Comment. All other factors are held constant, and we’ve suspended disbelief 
in the model. Shaw (1999, p. 352) interprets the coefficient 0.430 as meaning 
that a 500 GRP buy of TV time yields a 2.2 percentage point increase in votes. 


8. 


Use logistic regression not OLS, because the response variable is binary. 
For the ith subject, let Y; = 1 if that subject experienced a heart attack 
during the study period, else Y; = 0. The sample size is 


n = 6,224 + 27,034 = 33,258. 


The number of variables is p = 8 because there is an intercept, a treat- 
ment variable, and six covariates. The design matrix X is 33,258 x 8. 
Its ith row X; is 


[1 HRT; AGE; HEIGHT; WEIGHT; CIGS; HYPER; HICHOL, | 
where 


HRT; = 1 if subject i was on HRT, else HRT; = 0, 

AGE; is subject i’s age, 

HEIGHT; is subject i’s height, 

WEIGHT; is subject i’s weight, 

CIGS; = 1 if subject i was a smoker, else CIGS; = 0, 
HYPER; = 1 if subject i had hypertension, else HYPER; = 0, 
HICHOL; = 1 if subject i had high cholesterol levels, else 
HICHOL; = 0. 


The statistical model says that given the X’s, the Y’s are independent, 
and 
PLY; = 1|X} 
log = X;. 
1— P{Y; = 1|X} 


The crucial parameter is £2, the HRT coefficient. The investigators want 
B2 to be negative (HRT reduces the risk) and statistically significant. The 
SE would be estimated from the observed information. Then a t-test 


would be made. 


The model is needed to control for confounding. For causal inference, 
we also want a response schedule and an exogeneity assumption. All the 


CHAPTER 7 279 


usual questions are left open. Why these variables and that functional 
form? Why are the coefficients constant across subjects? And so forth. 


Comment. Salient missing variables are measures of status, like income. 
Moreover, it could well be the more health-conscious women who are tak- 
ing HRT, which requires medical supervision. In this example, experimental 
evidence showed the observational data to be misleading (see the end notes 
for the chapter). 


9. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


In an experiment, the investigator assigns the subjects to treatment or 
control. In an observational study, the subjects assign themselves (or are 
assigned by some third party). The big problem is confounding. Possible 
solutions include stratification and modeling. See text for discussion and 
examples. 


The fraction of successes in the treatment group is an unbiased estimate 
of a’. The fraction of successes in the control group is an unbiased 


estimate of w©. The difference is an unbiased estimate of a7 — a°. 


Each model assumes linear additive effects on its own scale—look at the 
formulas. Randomization justifies neither model. Why would it justify 
one rather than the other, to say nothing of all the remaining possibilities? 
Just for example, treatment might help women, not men. Neither model 
allows for this possibility. Of course you can—and probably should— 
analyze the data without anything like logits or probits: see chapter 1, 
especially tables 1 and 2. Also see Freedman (2006b, 2008a, 2008c). 


Not a good idea. Here, one child’s outcome may well depend on neigh- 
boring children’s assignments. (Malaria is an infectious disease, trans- 
mitted by the Anopheles mosquito.) 


Looks good so far. 


Stratification is probably a better way to go—fewer assumptions. On the 
other hand, the groups might be heterogeneous. With more covariates 
used to define smaller groups, you may run out of data. Finally, with 
stratification, there’s no way to estimate what would happen with other 
values of covariates. 


Maximum likelihood is a large-sample technique. With 400 observa- 
tions, she’d be fine. With 4, the advice is, think again. 


The quote might be a little mixed up. White’s correction is a way 
of taking heteroscedasticity into account when computing standard er- 
rors for OLS (end notes to chapter 5). The relevance to the MLE is 
not obvious. The Y;, will be heteroscedastic given the X’s, because 
var(Y¥i4|X) = Pin = O|X) x [1 — P(Yi = 0|X)] will depend on 
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i and t. If the model is right—that’s a whole other issue—the MLE 
automatically accounts for differences in P(Y;; = 0|X) across i and t. 


Comment. The investigators might have been thinking of Huber’s “sandwich 
estimator” for the standard error, which is robust against certain kinds of 
misspecification—although the MLE may then be quite biased. See Freed- 
man (2006a). 


17. Sounds like non-identifiability. 


18. Even if the model is right, and c > 0, the combined effect of left-wing 
power in country i and year t is 


a x LPPit + b x TUP; + c x LPP; x TUP; , (*) 


which can be negative. It all depends on the size of a, b, c and LPP;z, 
TUP;;. Maybe the right wing has a point after all. 


Comments. (i) With Garrett’s model, the combined effect (x) of left-wing 
power was to reduce growth rates for most years in most countries—contrary 
to his opinion. (ii) The €;; are random errors, with mean 0; apparently, Garrett 
took these errors to be IID in time, but allowed covariance across countries. 


19. In this exercise, LPP, TUP, and the interaction don’t matter—they are 
folded into Z. To create Garrett’s design matrix M, which is 350 x 24, 
stack the data as in exercise 5C3, with 25 observations on country #1 
ordered by year—at the top, then the observations for country #2, .... 
The first 14 columns of M are the country dummies; œ; is the coefficient 
of the dummy variable for country i. Take L to be a 24x24 matrix with 
1’s along the main diagonal; the first 14 entries in the first column are 
also 1: all other entries are 0. You should check that ML gives Beck’s 
design matrix. Now 


[((ML)'(ML)]|~!(ML)Y = [L'(M'M) L]! L'M'Y 
= L7!(M'MY! (L)! L' M'Y 
= L`! (M'M) `! M'Y. 


If 6 is Garrett’s parameter vector and 6* is Beck’s, then B* =L! B , SO 
Ê = Lf*. (A more direct argument is possible too.) 
20. In 1999, statisticians placed less reliance on the normal law of error than 


they did in 1899. (What will things look like in 2099?) Yule is playing 
a little trick on Sir Robert. If the OLS model holds, OLS estimates are 
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unbiased. But why does the model hold? Choosing models is a rather 
subjective business that goes well beyond the data—especially when 
causation gets into the picture. 


Chapter 8 The Bootstrap 
Exercise Set A 


1. True. 


2. These statements are all true, illustrating the idea of the bootstrap. (Might 
be even better, e.g., to divide by 99 not 100, but we’re not going to be 
fussy about details like that.) 


3. (a) o/V5000. (b) /V/V100. (c) VV. 


Reason for (b): Xave is the average of 100 IID X (y's whose sample 
variance is V. In (c), there is no need for “around.” 


Exercise Set B, Chapter 8 


Choose (ii). See text. 

2. The parameters are the 10 regional intercepts a; and the five coefficients 
b,c, d,e, f. These are unobservable. So are the random errors ôç, j. 
Otherwise, everything is observable. 

3. The disturbance term for 1975 could have been different, and then energy 
consumption would have been different. 

0.281 is the one-step GLS estimate for the parameter f. 

5. One-step GLS is biased, for estimating d, e, f: compare columns (A) 
and (C). The bias in f, for instance, is highly significant, and comparable 
in size to the SE for f. Not trivial. 

6. Biased. Compare columns (D) and (E): see text. 

7. Biased, although not as badly as the plug-in SEs. Compare columns (D) 
and (F): see text. 

8. The bootstrap is not reliable with such a small sample. With 40 obser- 
vations, maybe. But with 4?!? Maybe Paula needs another idea. 

9. 6 = Y; —a—bY;-) for alli. If 1 < i < n, then Y; = Xj41,2 and 
Yi—1 = X;,2 can be computed from X. So, €n IL X, but the earlier €; are 
completely dependent on X. 
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Chapter 9 Simultaneous Equations 
Exercise Set A 


1. a, should be positive. Supply increases with price. By contrast, az and 
a3 should be negative. When the price of labor and materials goes up, 
supply goes down. 


2. bı should be negative. Demand goes down as price goes up. Next, b2 
should be negative. Demand goes down as the price of complements 
goes up. You’re not going to spread butter on that ten-dollar piece of 
bread, because you’re not going to eat that piece of bread in the first 
place. Finally, b3 should be positive. Demand goes up as the price 
of substitutes goes up. When olive oil costs $50 an ounce, you throw 
caution to the winds and eat butter. 


3. The law of supply and demand is built into the model as an assumption: 
Q; and P; are the market-clearing quantity and price. We got them by 
solving the supply and demand equations in year t, i.e., by finding the 
point where the supply curve crosses the demand curve. See figure 1, 
and equations (2)-(3). 


The supply curve is concave but not strictly concave. (It’s linear.) The 
demand curve is convex but not strictly convex. (It’s linear too.) For 
this reason among others, economists prefer log linear equations, like 


log Q = aọ + aı log P + az log W + a3 log H + 6,, 
log Q = bo + bı log P + b2 log F + b3 log O + &. 


4. Equation (2a) is the relevant one: (2b) says how consumers would re- 
spond. The reduced-form equations (3a) and (3b) say how quantity 
and price would respond if we manipulated the exogeneous variables 
W,, H;, F;, O+. Notice that P, does not appear on the right hand side of 
(3a); and Q; does not appear on the right hand side of (3b). 


Exercise Set B, Chapter 9 


1. For part (a), let c be px 1. Then c'Z' Ze = || Zel? > 0. If c'Z'Ze = 0, 
then Zc = 0 and Z’Zc = 0, soc = 0 because Z'Z has full rank (this 
was given). Thus, Z’Z is positive definite. The rest of part (a) follows 
from exercise 3D7. For (b), the matrix (Z’Z)~! is positive definite, so 


cC X'Z(Z'Z!Z' Xc > 0; 
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Equality entails Z’Xc = 0, hence c = 0, because Z’X has full rank 
(this was given). Thus, X’Z(Z’Z)~!Z'X is positive definite, hence, 
invertible (exercise 3D7). 
2. (a) and (b) are true; (c), (d), and (e) are false. 
Comment. With a large sample, the sample mean will be nearly the same as 
E(U;), and the sample variance will be nearly the same as var(U;). But the 
concepts are different—and with small or medium-sized samples, so are the 
numbers (section 2.4). 


Exercise Set C, Chapter 9 


1. Don’t do that without further thought. According to the model, price 
and quantity are endogenous. You might want to fit by OLS even so 
(section 9.8), but you have to consider endogeneity bias. 


2. Statements (a)-(b)-(c) are all false, unless there is some miracle of can- 
cellation. The OLS residuals are orthogonal to X, but IVLS isn’t OLS. 
Statement (d) is true by definition (14). 


3. OLS always gives a better fit: see exercise 3B14(j). You do IVLS only 
if there’s a model you believe in, you want to estimate the parameters in 
that model, and are concerned about endogeneity bias. 


(: OLS may be ordinary, but it makes the least of the squares :) 


4. Biased. IVLS isn’t real GLS. We’re pretending that Z’X is constant. 
But that isn’t right, at least, not exactly. As the sample size grows, the 
bias will (with any luck) get small. 


5. The p x p matrix Z’X has full rank, by assumption (ii) in section 9.2. 
Hence, Z’X is invertible. By (10), 


Brivis = XAZI Z XZA ey 
= (Z'X)(Z'ZX(X' Z! WAU Am AL 
= (Z'X)!Z'Y. 
Watch it. This only works when q = p. Otherwise, Z’X isn’t square. 
6. From (10), 


Avis = [KAZAZ] CAC eae = OY 


where 
Q= [EZZ ZX XZ Z. 
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Now OX = Ipxp and Y = XB +ô, so OY = B+ Qô. Since Q is taken 
as constant (rather than random), 


cov{Bivis|Z} = o° OInxnQ' = 0° QQ = KAZAT Ab. 4 an 


Evaluating QQ’ is straightforward but tedious. 


Comments. (i) This exercise is only intended to motivate equation (13), which 
defines Cov( Bis |Z). Gi) What really justifies the definition is theorem 1 in 
section 9.8. (iii) If you want to make exercise 6 more mathematical, suppose 
X happens to be exogenous (so IVLS is an unnecessary trip); condition on X 
and Z. 


Exercise Set D, Chapter 9 


1. 0.128 — 0.042 — 0.0003 x 300 + 0.092 + 0.005 x 11 
+0.015 x 12 — 0.046 + 0.277 + 0.041 + 0.336 = 0.931. 


2. — 0.042 — 0.0003 x 300 + 0.092 + 0.005 x 11 
+0.015 x 12 — 0.046 + 0.277 + 0.041 + 0.336 = 0.803. 


Comment. In exercises 1 and 2, the parents live in district 1, so the universal- 
choice dummy is 0: its coefficient (—0.035) does not come into the calcula- 
tion. Frequency of church attendance is measured on a scale from 1 to 7, with 
“never” coded as 1. The 0.931 is indeed too close to 1.00 for comfort. . . . 


3. The difference is 0.128. This is the “effect” of school choice. 
4. estimated expected probabilities. We’re substituting estimates for pa- 


rameters in (23), and replacing the latent variable V; by its expected 
value, 0. 


5. School size is amuch bigger number than other numbers in the equation. 
For example, —0.3 x 300 = —90. If the coefficient was —0.3, we’d be 
seeing a lot of negative probabilities. 


6. No. The left hand side variable has to be a probability, not a 0-1 vari- 
able. Equation (2) in Schneider et al is about estimation, not modeling 
assumptions. 


7. Some of the numbers line up between the sample and the population, 
but there are real discrepancies, e.g., on the educational level of par- 
ents in District 4. In the sample, 65% have a high school education 
or better, compared to 48% in the population. (The SE on the 65% is 
something like 100% x ./0.48 x 0.52/333 = 3%: this isn’t a chance 
effect.) Schneider et al collected income data but elected not to use it. 
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Why not? The intervention is left unclear in the paper, as is the model. 
The focus is on estimation technique. 


Exercise Set E, Chapter 9 

1. Go with investigator #3, who is doing IVLS: exercise C5. Investigator 
#1 is doing OLS, which is biased. Investigator #2 is a little mixed up. 
To pursue that, we need some notation for the covariance matrix of 
Xi, Zi, €i, Yi. This isa4 x 4 matrix. The top left 3x3 corner in (*) 
shows the notation and assumptions. For example, o° is used to denote 
var(e;), w to denote cov(X;, Z;), and 0 to denote cov(X;, €;). Since Z; 
is exogenous, cov(Z;,€) = 0. The last row (or column) is derived by 
math. For instance, var(Y;) = B?var(X;) + var(e;) + 2Bcov(X;, €i) = 
B? +07 +20. 


Xi Zi €i Y; 
EFE 4 9 B+0 
Zi y 1 0 Bw 
éi 6 0 o? o? + BO l (*) 


Y; \B+0 Bw o7+f0 B?+07+286 


For investigator #2, the design matrix M has a column of X’s and a 
column of Z’s, so 


, (1 4 , . (Bt+8 
umn = (J ar mvjn= (Pe), 


(i) =e 7) 
y 1l l= y -y 17’ 


mom = (PA v) 
A my =( -0/0 -y j) 


When n is large, the estimator for 8 suggested by investigator #2 is biased 
by 6/(1 — Y°). A much easier calculation shows the OLS estimator is 
biased by 6. For the asymptotics of the IVLS estimator, see example 1 
in section 8. 

2. The correlation between Z and e€ is not identifiable, so Z cannot be used 
as an instrument. Here are some details. The basic thing is the joint 
distribution of X;, Z;, €i. (These are jointly normal random variables, 
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mean 0, and IID as triplets.) The joint distribution is specified by its 3 x 3 
covariance matrix. In that matrix, var(X;), var(Z;) and cov(X;, Zi) are 
almost determined by the data (n is large). Let’s take them as known. 
For simplicity, let’s take var(X;) = var(Z;) = 1 and cov(Xj, Zi) = 
5 There are three remaining parameters in the joint distribution of 
Xi ; Zi Pe 

cov(X;, €i) =0, cov(Zi, €i) =, var(e;) = 07. 


So the covariance matrix of X;, Zi, €i is 


Xi Zi Ei 
Xi /1 4 @ 
Z|} 1 @ |. (7) 
Ei 0 0) o? 


The other random variable in the system is Y;, which is constructed from 
Xi, Zi, €; and another parameter 6: Y; = 6X; + €;. We can now make 
a complete list of the parameters: 
(i) cov(X;, €i) = 0, 

Gi) cov(Z;, €i) = Q, 

Gii) var(€;) = 0, 

Gv) £. 
The random variable €; is not observable. The observables are X;, Z;, Y;. 
The joint distribution of X;, Z;, Y; determines—and is determined by— 
its 3 x3 covariance matrix (theorem 3.2). This matrix can be computed 
from the four parameters: 


x Zi Y; 
Xi 1 5 6+8 
Zi| 3 1 sB+¢ G) 
Y; \B+0 58+ B +o’ +2po 


For example, the 2,3 element in the matrix (repeated as the 3,2 element) 
is supposed to be cov (Y;, Zi). Let’s check. We’re given that E(X;) = 
E(Z;) = E(Y;) = E (€i) = 0. So 


cov(Y;, Zi) = E (Y: Zi) = E[(BX; + €i)Zi], 
which is 


BE(X;Z;) + E(Ziei) = Bcov(X;, Zi) +cov(Z;, €i) = 48 +ġ. 
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The joint distribution of X;, Z;, Y; determines—and is determined by— 
the following three things: 

(a) +9, 

b) 5B +4, 

(c) B? +07 + 260. 
That’s all you need to fill out the matrix ({), and that’s all you can get 
out of the data on X;, Z;, Y;, no matter how large n is. There are three 
knowns: (a)-(b)-(c). There are four unknowns 6, ¢, o7, 6. Blatant 
non-identifiability. 
To illustrate, let’s start with the parameter values shown in column #2 
of the following table. 


1 2 3 
1 3 

0 z 5 
1 

(o) 0 5 
o? 1 3 
B 2 1 


Then (a) B +0 = 2.5, (b) $B + = 1.0, and (c) f? +07 +280 =7.0. 
1 


Now, increase ġ to 5, as shown in column #3. Choose a new value for 
B so (b) doesn’t change, a new 0 so (a) doesn’t change, and o? so 
(c) doesn’t change. The new values are shown in column #3 of the 
table. Both columns lead to the same numbers for (a), (b), (c), hence the 
same joint distribution for X;, Z;, Y;. That already demonstrates non- 
identifiability, and there are many other possible choices. With column 
#2, Z is exogenous: cov(Z;,¢€;) = @ = 0. With column #3, Z is 
endogenous: cov(Z;, €;) # 0. Exogeneity cannot be determined from 
the joint distribution of the observables. That is the whole trouble with 


the exogeneity assumption. 


Comments. (i) This exercise is similar to the previous one. In that exercise, 
cov(Z;, €;) = 0 because Z; was given as exogenous; here, cov(Z;, €i) = @ 
is an important parameter because Z; is likely to be endogenous. There, 
cov(X;, Zi) = w was a free parameter; here, we chose y = 5 (for no partic- 
ular reason). There, we displayed the 4x4 covariance matrix of X;, Zi, €i, Yj. 
Here, we display two 3x3 covariance matrices. If you take o = Oand y = F, 
the matrices (*), (f), (£) will all line up. 
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(ii) For a similar example in a discrete choice model, see 
http://www.stat.berkeley.edu/users/census/socident.pdf 


(iii) There is a lot of econometric theorizing about instrumental variables. 
What it boils down to is this. If you are willing to assume that some variables 
are exogenous, you can test the exogeneity of others. 


3. This procedure is inconsistent: it gives the wrong answer no matter how 
much data you have. This is because you’re estimating o? with only 
q — p degrees of freedom. 


Discussion. In principle, you can work everything out for the following 
model, which has q = 2 and p = 1. Let (Uj, Vi, ôi, €i) be IID ini. The 
four-tuple (U;, V;, ôi, €;) is jointly normal. Each variable has mean 0 and 
variance 1. Although U;, V;, and (6;, €;) are independent, E (ô;€;) = p £0. 
Let X; = U; + Vi + <i and Y; = X;6 + ôi. The unknown parameters are 
p and $. The observables are U;, Vi, Xi, Yi. The endogenous X; can be 
instrumented by U;, V;. When n is large, Bwvts = f; the residual vector 
from (4) is almost the same as ô. Now you have to work out the limitng 
behavior of the residual vector from (6), and show that it’s pretty random, 
even with huge samples. For detail on a related example with g = p = 1, 
see 


http://www.stat.berkeley.edu/users/census/ivls.pdf 


Discussion questions, Chapter 9 


1. Great ad. Perfect example of “lead time bias.” Earlier detection implies 
longer life after detection, because the detection point is moved back- 
wards in time—but we want longer life overall. For example, if detection 
techniques improve for an incurable disease, there would an increase in 
survival after detection—but no increase in lifespan. Useless. 


2. Another great example of lead time bias. For discussion, see Freedman 
(2008b). 


3. Answer. Not a good study either. If it’s a tie overall, and the detection 
rate is higher with dense breasts, it must be lower with non-dense breasts 
(as can be confirmed by looking at the original paper). Moreover, digital 
mammography might be picking up cancers that are not treatable. There 
are significant practical advantages to digital mammography, but this 
study doesn’t make the case. 

4. More numerators without denominators. In how many cases did eye- 


witness testimony lead to righteous convictions? What is the error rate 
for other kinds of evidence? 
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5. 


Suppose a marathon is run over road R in time period T in county C; 
the road is closed for that period. The idea is that if the marathon had 
not been run, there would have been traffic on road R in period T, with 
additional traffic fatalities. Data are available at the county level only. 
Suppose the controls are perfect (a doubtful assumption). Then we know 
what the fatalities would have been in county C in period T, but for the 
marathon. This is bigger than the actual number of fatalities. The study 
attributes the difference to the traffic that would have occurred on road 
R in period T, if the marathon had not been run. 


The logic is flawed. For example, people elsewhere in the county may 
decide not to drive during period T in order to avoid the congestion 
created by the marathon, or they may be forced to drive at low speeds 
due to the congestion, which would reduce traffic fatalities. To be sure, 
there may be arguments to meet such objections. But, on the whole, the 
paper seems optimistic. 


As far as the headline is concerned, why are we comparing running to 
driving? How about a comparison to walking, or reading a book? 


(a) The controls are matched to cases within treatment, and average age 
(for instance) depends on treatment. Age data are reported in the 
paper, but the conclusion is pretty obvious from the survival rates 
for the controls. 

(b) See (a). 

(c) Surgeons prefer to operate on relatively healthy patients. If you 
have a serious heart condition, for instance, the surgeon is unlikely 
to recommend surgery. Thus, the cases are generally healthier than 
the age-matched controls. 

(d) No. See (c). This is why randomized controlled experiments are 
needed. 


Comment. This is a very good paper, and the authors’ interpretations of 
the data—which are different from the mistakes naturally made when 
working the exercise—are entirely sensible. The authors also make an 
interesting comparison of intention-to-treat with treatment-received. 


Neither formula is good. This is a ratio estimate, 
(Yi +-+ + Y25)/(X1 + +- + X25), 
where X; is the number of registered voters in village i and Y; is the 


number of votes for Megawati. We’re not counting heads to estimate p 
when a coin is flipped n times, so p(1 — p)/n is irrelevant. 
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8. 


10. 


11. 


12. 


13. 


14. 
15. 
16. 
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(a) The errors €; should be IID with mean 0, and independent of the 
explanatory variables. 

(b) The estimate b should be positive: the parameter b says how much 
happier the older people are, by comparison with the younger ones. 
The estimate ¢ should be positive: c says how much happier the 
married people are, by comparison with the unmarried. The esti- 
mate d should be positive: a 1% increase in income should lead to 
an increase of d points on the happiness scale. 

(c) Given the linearity assumption, this is not a problem. 

(d) Now we have near-perfect collinearity between the age dummy and 
the marriage dummy, so SEs are likely to be huge. 

(e) The Times is a little confused, and who can blame them? (i) Calcu- 
lations may be rigorous given the modeling assumptions, but where 
do the assumptions come from?? For instance, why should U; be 
dichotomous, and why cut at 35? Why take the log of income? And 
so forth. (ii) Sophistication of computers and complexity of algo- 
rithms is no guarantee of anything, except the risk of programming 
error. 


The form of the equations, the parameter values, the values of the control 
variables, and the disturbance terms have to be invariant under interven- 
tions (section 6.4). 


Disagree. Random error in a putative cause is liable to bias its coefficient 
toward zero; random error in a confounder works the other way. With 
several putative causes and confounders, the direction of bias is less 
predictable. If measurement error is non-random, almost anything can 
happen. 

If (24) is OK, (25) isn’t, and vice versa. Squint at those error terms. For 
example, €; = 6;,1—4;,,-1. Ifthe 5’s are IID, the e’s aren’t. Conversely, 
oie = Eit + €i2-1 +--+. If the €’s are IID, the 5’s aren’t. 

(a) The model is wrong. (b) The third party is suggesting the heterogene- 
ity should be modeled. This adds another layer of complexity, probably 
doesn’t come to grips with the issues. 


Yeah, right. By the time you’ve tried a few models, the P-values don’t 
mean a thing, and you’re almost guaranteed to find a good-looking—but 
meaningless—model. See section 5.8 and Freedman (2008d). 


Put (i) in the first blank and (ii) in the second. 
Put (i) in the first blank and (ii) in the second. 


False: you need the response schedule. 
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17. 
18. 


19. 


20. 
21. 


22. 


23. 


24. 


That the explanatory variables are independent of the error term. 


Maybe sometimes. For example, if we know the errors are IID, a residual 
plot might refute the linearity assumption. 


Maybe sometimes. The statistical assumptions might be testable, up to 
a point, but how would causation get into the picture? Generally, it’s 
going to be a lot harder to prove up the assumptions than to disprove 
them. 


This is getting harder and harder. 


Oops. The U; is superfluous to requirements. We should (i) condition 
on the exogenous variables, (ii) assume the Y; are conditionally inde- 
pendent, and (iii) transform either the LHS or the RHS. Here is one 
fix: 


prob(Y¥; = 1|G, X) = A(a+ BG; +Xiy), where A(x) = e*/(1+e*). 


An alternative is to formulate the model using latent variables (section 
7.3). The latents U; should be independent ini with common distribution 
function A, and independent of the G’s and X’s. Furthermore, 


Y; = 1 if and only if w+ BG; + Xiy + Ui > 0. 


But then, drop the “prob.” 


The assumption that pairs are independent is built into the log likelihood— 
otherwise, why is a sum relevant? This is a pretty weird assumption, 
especially given that 7 is common to (i, j) and (i, k). And why Poisson?? 
No. Endogeneity bias will usually spread, affecting a and b as well as 
ĉ. For step-by-step instructions on how to do this problem and similar 
ones, see 


http://www.stat.berkeley.edu/users/census/biaspred.pdf 
Layout of answers matches layout of questions. 


FFFT 
TFFF 
FFFF 
FFT 
FFF 
FTFF 
F 

F 

F 
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(a) Subjects are IID because the triples (X;, 4;, €;) are IID in i. 

(b) Intercepts aren’t needed because all the variables have expectation 0. 

(c) Equation (28) isn’t a good causal model. The equation suggests 
that W is acause of Y. Instead, Y causes W. 


Preparing for (d) and (e) 
Write [XY] for lim 1 yo", X:Y;, and so forth. Plainly, 


[XX]=1, [XY]=b, [YY] =b +0?°, [WX]= bc, 
[WY] = c(b? +0?), [WW] = b? + 0o? +1’. 
(d) The asymptotic R’s for (26) and (27) are therefore 


b? m eb + 0”) 
b2? + 02 c2(b? +07) + 1?’ 


respectively. The asymptotic R* for (28) can be computed, with 
patience (see below). But here is a better argument. The R°? for 
(28) has to be bigger than the R? for (27). Indeed, with simple 
regression equations, R? is symmetric. In particular, the R? for 
(27) coincides with the R? for (x): 


Y; = fWi + vi. C) 


But the R? for (28) is bigger than the R? for (*): the extra variable 
helps. So the R? for (28) is bigger than the R? for (27), as claimed. 
Now fix b and t° at any convenient values. Make o? large enough 
to get a small R? for (26). Then make c large enough to get a big 
R? for (27) and hence (28). 


If we fit (28), the product moment matrix divided by n converges 


to 
i Sig? ag? ++ Tr? a 


(e 


wm 


bc 1 
The determinant of this matrix in is c2o0? + t?. The inverse is 
1 1 —bc 
e202 +r? \ —be b??? +. eol +r] 


The limit of the OLS estimator is therefore 


1 1 —bec b?c + co? 
e202 + rt? \ —be b?c? + c207 + T? b i 
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So 

2 br? 

> A > P 
co? 4r? Coz 4e 


> 
9 
q 


® 


Comments. Putting effects on the right hand side of the equation is not 
uncommon, and often leads to spuriously high R*’s. In particular, R? 
does not measure the validity of a causal model. Instead, R? measures 
only strength of association. 


The Computer Labs 


Introduction 


Labs are a key part of the course: the computations illustrate some of the 
main ideas. At Berkeley, labs are set up for MATLAB in a UNIX environment. 
The UNIX prompt is (usually) a percent sign. At the prompt, type mat lab. 
After a bit, MATLAB will load. Its prompt is >>. If you type edit at 
the prompt, you get a program editor. Changes for WINDOWS are pretty 
straightforward: you can launch MATLAB from the start menu, and get the 
program editor by clicking on an icon in a toolbar. The directory names will 
look different. 

Don’t write MATLAB code or create data files in a word processing 
package like WORD, because formatting is done with a lot of funny characters 
that MATLAB finds indigestible. (You can work around this, but why bother?) 
In UNIX, gedit is a straight-ahead program editor. In WINDOWS, you can 
use notepad or wordpad, although Text Pad is a better bet: 


http: //www.textpad.com 


If you type helpdesk at the MATLAB prompt, you get a browser- 
based help facility, with demos and tutorials. If you only want help on a 
particular command, type help at the MATLAB prompt, followed by the 
name of the command. For instance, help load. This works if you know 
the name of the command.... 

MATLAB runs interactively from the command prompt, and you can do 
a lot that way. After a while, you may want to store commands in a text file. 
This sort of file is called a “script file.” Script files make it easier to edit and 
debug code. Script files end with the suffix .m, for instance, demolab.m. 
If you have that file on your system, type demolab at the MATLAB prompt. 
MATLAB will execute all the commands in the file. (There is an annoying 
technicality: the file has to be in your working directory, or on the search path: 
click on File and follow your nose, or type help path at the MATLAB 
prompt, or—if all else fails—look at the documentation.) 
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A lot of useful MATLAB features are illustrated in demolab . m, includ- 
ing “function files,’ which are special script files needed later in the course. 
Some people like to read computer documentation, and MATLAB has pretty 
good documentation. Or, you can just sit down at the keyboard and start 
fooling around. Some people like to look at code: demolab . m—listed in 
an appendix below—is for them. 

When you are finished running MATLAB, type exit to end your ses- 
sion, or quit the window. (In UNIX, quitting a window is a much more final 
act than closing it.) Oh, by the way, what happens if your program goes 
berserk and you need to stop it? Just hit control-C: hold down the control- 
key, press C. That will return you to the command prompt. (Be patient, it 
may take a minute for MATLAB to notice the interrupt.) 

Data sets used in the labs, and sample code, are available at 


http://www.stat.berkeley.edu/users/census/data.zip 


Numerics 


Computers generally do “IEEE arithmetic,’ which isn’t exactly arith- 
metic. There is roundoff error. MATLAB is usually accurate to 107!*. It 
seldom does better than 10~!°, although it can. Here is some output: 


>> (sqrt (2))^2-2 
ans = 

4.4409e-016 
>> (sqrt (4))^2-4 
ans = 


0 
4.4409e-016 is MATLAB’s way of writing 4.4409x107!°. This is round- 
off error. 
Lab 1 
Summary Statistics and Simple Regression 


In this lab, you will calculate some descriptive statistics for Yule’s data 
and do a simple regression. The data are in table 1.3, and in the file 


yule.dat 


296 STATISTICAL MODELS 
You need to subtract 100 from each entry to get the percent change. Refer to 
chapter 1 for more information, or to yuledoc.txt. 


1. Compute the means and SDs of APaup, AOut, APop, and AOld. 

2. Compute all 6 correlations between APaup, AOut, APop, and AOld. 
3. Make a scatter plot of APaup against AOut. 
4 


Run a regression of APaup on AOut, i.e, find the slope and intercept of 
the regression line. You might also compute the SD of the residuals. 


Useful MATLAB commands: load, mean, std, corrcoef, 
plot (u, v,’ x'). 


Lab 2 
An Exercise with MATLAB 
1. Create a4 x 3 matrix X and a 4 x 1 vector Y: 
1 
Y= 


RWN eR 


1 -1 1 
1 2 3 
4 5 6]? 
7 8 9 


2. Compute X’X, X’Y, det X’'X, rank X, rank X’X. 
3. Compute (X’X)~!. 
Write a single line of MATLAB code to compute 


Ê = (X'X) 7! X’Y. 


Report Ê as well as the code. 


P 238-3) oT 
a=(-1 2 9 =) 
6 3 0 33 


Compute trace AX and trace XA. Comment? 


5. Let 


Useful MATLAB commands: A’, A+B, A-B, A*B, det, inv, 
rank, trace, size. To create a matrix, type Q=[1 2 3; 4 5 6], 
or do it on two lines: 

Q=[1 2 3 

45 6] 
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Lab 3 
Replicating Yule’s Regression 


In this lab, you will replicate Yule’s regression equation for the metropoli- 
tan unions, 1871—81. See chapter 1. Fix the design matrix X at the values re- 
ported in table 1.3. (Subtract 100 from each entry to get the percent changes.) 
The data are in the file yule.dat. The file yuledoc . txt gives the vari- 
able names. Yule assumed 


APaup; = a+b x AOut; +c x AOld; + d x APop; + €i 


for 32 metropolitan unions i. For now, suppose the errors €; are IID, with 
mean 0 and variance o°. 

1. Estimate a, b, c,d, and o°. 

2. Compute the SEs. 

3. Are these SEs exact, or approximate? 

4 


Plot the residuals against the fitted values. (This is often a useful diag- 
nostic: if you see a pattern, something is wrong with the model. You 
can also plot residuals against other variables, or time, or. . . .) 


Useful MATLAB commands: ones (32,1), [A B]. 


For bonus points. If you get a different answer from Yule, why might that 
be? 


Lab 4 

Simulation with MATLAB 
1. Simulate observations on 32 IID normal variables X; with mean u = 15 
and variance ø? = 100. 
Calculate the sample mean X and the sample SD ô of the data. 
Repeat 1 and 2, 1000 times. 
Plot a histogram of the 1000 X’s. Comment? 
Plot a histogram of the 1000 G’s. Comment? 
Plot a scatter diagram of the 1000 pairs (X, 6). Comment? 


Calculate the SD of the 1000 X’s. How does this compare to 0 //32? 
Comment? 


SL Oye SOP PS I oS 
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Useful MATLAB commands: rand, randn, for...end, hist(x,25). 


MATLAB loves matrices. It hates loops. Your code should include a couple 
of lines like 


FakeData=randn (32,1000); 
Aves=mean (FakeData) ; 


Put in the semicolons, or you will spend a lot of time watching random 
numbers scroll by on the screen. 


Random Numbers 


Appearances notwithstanding, computers have no random elements. 
MATLAB generates “pseudo-random” numbers—numbers which look pretty 
random—by some clever numerical algorithm that is completely determinis- 
tic. One consequence may take you by surprise. With any particular release 
of the program, if you start a MATLAB session and type rand (1), you will 
always get the same number. (With Release 13, the answer is 0.9501.) In par- 
ticular, you might get exactly the same results as all the other students in the 
class who are doing Lab 4. (Doesn’t seem random, does it?) A work-around, 
if you care, is to burn some random numbers before doing a simulation—type 
x=rand (abcd, 1) ;, where abcd is the last four digits of your telephone 
number. 


Lab 5 
The t-Test. Part 1. 


Yule’s model is described in chapter 1, and in Lab 3. Fix the design 
matrix X at the values reported in table 1.3. (Subtract 100 from each entry to 
get the percent changes.) Suppose the errors e; are ID N(0, o*), where o? 
is a parameter (unknown). Make a t-test of the null hypothesis that b = 0. 
What do you conclude? If you were arguing with Yule at a meeting of the 
Royal Statistical Society, would you want to take the position that b = 0 and 
he was fooled by chance variation? 


The t-Test. Part 2. 


In this part of the lab, you will do a simulation to investigate the distri- 


bution of feats 
t = b/SE, 


under the null hypothesis that b = 0. 
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1. Set the parameters in Yule’s equation (Lab 3) as follows: a = —40, b = 
0, c=0.2, d = —0.3, o = 15. Fix the design matrix X as in Part 1. 


2. Generate 32 N (0, o°) errors and plug them into the equation 


APaup; = —40 + 0 x AOut; + 0.2 x AOld; — 0.3 x APop; + €i, 


to get simulated values for APaup;, with i = 1, ..., 32. 
3. Regress the simulated APaup on AOut, APop, and AOld. Calculate b, 
SE, and ft. 


4. Repeat 2 and 3, 1000 times. 

5. Plot a histogram for the 1000 b’s, a scatter diagram for the 1000 pairs 
(b, ô), and a histogram for the 1000 t’s. 

6. What is the theoretical distribution of b? of 62? of t? How close is the 
theoretical distribution of t to normal? 


7. Calculate the mean and SD of the 1000 5’s. How does the mean compare 
to the true b? (“True” in the simulation.) How does the SD compare to 
the true SE for b? 


You need to compute (X’X)~! only once, but ĉ? many times. Your code will 
run faster with more matrices and fewer loops. (As they say, vectorize your 
code.) Try this: 

beta=[-40 0 .2 -.3]’ 

sigma=15 

betaSim=X\ (X*beta*ones (1,1000)+sigma*randn(32,1000)); 
The backslash operator does the least squares fit. 

For discussion. Would it matter if you set the parameters differently? For 
instance, you could try a = 10,b = 0,c = 0.1,d = —0.5 ando = 25. 
What if b = 0.5? What if ci ~o x (x2 — 5)/v 10? The simulation in this lab 
is for the size of the test. How would you do a simulation to get the power of 
the test? (Size and power are defined below.) 


A tangential issue. Plot a scatter diagram for the 1000 pairs (å, b). What 
accounts for the pattern? 


Hypothesis Testing 


The discussion question in Part 2 of Lab 5 refers to size and power. To 
review these ideas, and put them in context, let 0 be a parameter (or parameter 
vector). Write Pg for the probability distribution of the random variables in 
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the model, when the parameter is 6. The null hypothesis is a set of 6’s; the 
alternative is a disjoint set of 0’s. Let T be a test statistic. We reject the null 
if T > k, where k is the critical value, chosen so that Pg(T > k) < a for all 
0 in the null. Here a is the size or level of the test. Power is Poa (T > k) for 
0 in the alternative. This will depend, among other things, on k and 0. 

From the Neyman-Pearson perspective, the ideal test maximizes power— 
the chance of rejecting the null when the null is false—while controlling the 
size, which is the chance of rejecting the null when the null is true. 

With the t-test in Lab 5, the parameter vector 0 is a, b, c, d, o”. The null 
is the set of 0’s with b = 0. The alternative is the set of 0’s with b Æ 0. The 
test statistic T is, surprise, |t|. If you want a = 0.05, choose k = 2. More 
precisely—with normal errors—you want the k such that the area beyond +k 
under Student’s t-density with 28 degrees of freedom is equal to 0.05. The 
answer to that riddle is 2.0484.... (See page 309.) For our purposes, the 
extra precision isn’t worth the bother: k = 2 is just fine. 

The observed significance level P or Pops is Po (T > Tops), where Tobs 
is the observed value of the test statistic. If you think of Tops as random 
(i.e., before data collection), then Pops is random. In Lab 5 and many similar 
problems, if 0 satisfies the null hypothesis, then Pops is uniform on [0,1]: that 
is, Po(Pobs < p) = p forO < p < 1. If the null is b < O vs the alternative 
b > 0, then T = ¢ rather than |t|, and Pops is uniform when b = 0. If b < 0 
then Pa (Pobos < p) < p forO < p < 1. With a one-sided null hypothesis, 
Pops is generally computed assuming b = 0 (the worst-case scenario). 


Lab 6 
The F-Test. Part 1. 


Yule’s model is explained in chapter 1, and in Lab 3. Fix the design 
matrix X at the values reported in table 1.3. (Subtract 100 from each entry to 
get the percent changes.) Assume that the errors €; are IID N (0, o”). Test the 
null hypothesis that c = d = 0. Use the F-test, as explained in section 5.7. 

1. Fit the big model and the small model to Yule’s data by OLS and compute 
the sums of squares that are needed for the test: lell, ||XBl?, and 

[Xp |. 

. Calculate the F-statistic. What do you conclude? 
3. Is IYI? = XB? + (IXI? — XB?) + llell?? Coincidence or 
math fact? 
MATLAB tip. X(:,1:2) picks off the first two columns in X. Colons are 
powerful. 
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The F-test. Part 2. 


In this part of the lab, you will use simulation to investigate the distri- 
bution of the F-statistic for testing the null hypothesis that c = d = 0. You 
should consider two ways to set the parameters: 


(i) a= 8, b=0.8, c=0, d=0, o = 15 

Gi) a= 13, b = 0.8, c = 0.1, d = —0.3, o = 10 
Fix the design matrix X as in Part 1. Simulate data from each set of parameters 
to get the distribution of F. 


For example, let’s look at (i). Generate 32 €’s and use the equation 


APaup; = 8+ 0.8 x AOut; + €i 


to get simulated data on APaup. Calculate F. Repeat 1000 times and make 
a histogram for the values of F. You can take the €’s to be ID N(0, 157). 


Repeat for (ii). Which set of parameters satisfies the null hypothesis and 
which satisfies the alternative hypothesis? Which simulation tells you about 
size and which about power? 


For discussion. Would it matter if you set the parameters in (i) differently? 
For instance, you could try a = 13,b = 1.8,c = 0,d = Oando = 25. 
Would it matter if you set the parameters in (ii) differently? What if the errors 
aren’t normally distributed? 


Vectorizing Code 


These days, computers are very, very fast. It may not pay to spend a lot of 
time writing tight code. On the other hand, if you are doing a big simulation, 
and it is running like molasses, getting rid of loops is good advice. If you 
have nested loops, make the innermost loop as efficient as you can. 


Lab 7 
Collinearity 


In this lab, you will use simulation to examine the effect of collinearity. 
To get started, you might think about r = 0.3 where collinearity is mild, 
and r = 0.99 where collinearity is severe. If you feel ambitious, also try 
r = —0.3 andr = —0.99. 


1. Simulate 100 IID picks (&;, ¢;) from a bivariate normal distribution, 
where E(&;) = E(¢;) = 0, E(P) = E(¢?) = 1, and E(&¢;) = 0. Use 
randn(100,2). 
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2. As data, these columns won’t quite have mean 0, variance 1, or correla- 
tion 0. (Why not?) Cleaning up takes a bit of work. 
(a) Standardize € to have mean 0 and variance 1: call the result U. 


(b) Regress ¢ on U. No intercept is needed. (Use the backslash oper- 
ator \ to do the regression.) Let e be the vector of residuals. So 
e L U. Standardize e to have mean 0 and variance 1. Call the 
result W. 


(c) Let r be the correlation you want. Set V = rU + V1 —r2W. 


(d) Check that U and V have mean 0, variance 1, and correlation r— 
exactly. (Exactly? or up to roundoff error’) 


3. Simulate Y; = U; + V; + €; fori = 1,..., 100, where the e; are IID 
N(O, 1). Use randn (100, 1) to get the e’s. 


4. Fit the no-intercept regression equation 
Y; = GU; + ÊV; + residual 


to your simulated data set. 
5. Repeat 1000 times, keeping U and V fixed. 
6. Plot histograms for a, b, at b, and â — È. 


7. There are four parameters of interest: a, b, a + b, a — b. What are their 
true values? Which parameter is easiest to estimate? Hardest? Discuss 
briefly. 


For bonus points. Why don’t you need an intercept in 2(b)? in 3? Does 
it matter whether you regenerate (U, V) in step 4, rather than keeping them 
fixed? 


MATLAB tip. std (x) divides by n — 1, but std (x, 1) divides by n. You 
can work the lab either way: just be consistent. 


Lab 8 
Path Diagrams 


In this lab, you will replicate part of Blau and Duncan’s path model 
in figure 6.1. Equation (6.3) explains son’s occupation in terms of father’s 
occupation, son’s education, and son’s first job. Variables are standardized. 
Correlations are given in table 6.1. 
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1. Estimate the path coefficients in (6.3) and the standard deviation of the 
error term. How do your results compare with those in figure 6.1? 


2. Compute SEs for the estimated path coefficients. (Assume there are 
20,000 subjects.) 


Lab 9 
More Path Diagrams 


In this lab, you will replicate Gibson’s path diagram, which explains 
repression in terms of mass and elite tolerance (section 6.3). The correla- 
tion between mass and elite tolerance scores is 0.52; between mass tolerance 
scores and repression scores, —0.26; between elite tolerance scores and re- 
pression scores, —0.42. (Tolerance scores were averaged within state.) 


1. Compute the path coefficients in figure 6.2, using the method of sec- 
tion 6.1. 

2. Estimate o°. Gibson had repression scores for all the states. He had 
mass tolerance scores for 36 states and elite tolerance scores for 26 
states. You may assume the correlations are based on 36 states—this 
will understate the SEs, by a bit—but you need to decide if p is 2 or 3. 

3. Compute SEs for the estimates. 

Compute the SE for the difference of the two path coefficients. You will 


need the off-diagonal element of the covariance matrix: see exercise 
4B14(a). Comment on the result. 


Note. Gibson used weighted regression, this lab does not use weights (but 
see http://www.stat.berkeley.edu/users/census/repgibson.pdf). 


Lab 10 
Maximum Likelihood 


In this lab, you will compute the MLE by numerical maximization of 
the log likelihood. Suppose that X; are IID fori = 1,2,...,50. Their 
common density function is 6/(6 + x)? for 0 < x < oo. The parameter 6 
is an unknown positive constant. See example 4 in section 7.1. Data on the 
X;’s are in the file mle.dat. This is a complicated lab, which might take 
two weeks to do. 


1. Write down the formula for the log likelihood; plot it as a function of 0. 


2. Find the MLE 6 by numerical maximization. It will be better to use 
the parameter ¢ = log@. If ¢ is real, 0 = e? > 0, so the positivity 
constraint on 0 is satisfied, and no constraint needs to be imposed on @. 
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3. Put a standard error on Ô. (See theorem 7.1, and exercise 7A8.) 
Some useful MATLAB commands: fminsearch, log, exp. 


fminsearch does minimization. (Minimizing — f is the same as maxi- 
mizing f, although it’s a little more confusing.) Use the syntax 


phiwant=fminsearch(@negloglike, start, [ ], x) 


Here, phiwant is what you want—the parameter value that minimizes the 
negative log likelihood. The MLE for 0 is exp (phiwant). The at-sign 
@ is MATLAB’s way of referring to functions. fminsearch looks for a 
local minimum of negloglike near the starting point, start. The 
log median of the data is a good choice for start. This particular negative 
likelihood function has a unique minimum (exercise 7A6). The rationale for 
the log median is exercise 7A7, plus the fact that @ = log @. Starting at the 
median or log median of the data is not a general recipe. Some versions of 
MATLAB may balk at @: if so, try 
phiwant=fminsearch(’negloglike’,...) 

You have to write negloglike.m. This is a function file that computes the 
negative log likelihood from phi and x, where phi is the parameter log 6 
and x is the data—which you get from mle. dat. The call to fminsearch 
passes the data x to negloglike.m. It does not pass the parameter phi 
to negloglike.m: MATLAB will minimize over phi. The first line of 
negloglike.m should be 


function negll=negloglike(phi,x) 


The rest of the file is MATLAB code that computes neg11—the negative 
log likelihood—from phi and x. At the end of negloglike.m, you need 
a line of code that sets neg11 to the value that you have computed from 
phiand x. 


Just to illustrate syntax, here is a function file that computes (u + cos u)? 
from u. 
function youpluscosyoutoo=fun (u) 
youpluscosyoutoo= (u+cos(u) )*2; 


You would save these two lines of code as fun.m. If at the MATLAB 
prompt—or in some other m-file—you type fun (3) , MATLAB will return 
(3 + cos 3)? = 4.0401. If you type 


fminsearch (@fun, 1) 


MATLAB will return —0.7391, the u that minimizes (u +cos u)”. The search 
started at 1. 
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Lab 11 
Simulations for the MLE 


In this lab, you will investigate the distribution of 6 , the maximum like- 


lihood estimate of 6, for the model in Lab 10. You should be able to reuse 
most of your code. This might be an occasion for loops. 


1. 


a 


Generate 50 IID variables U; that are uniform on [0, 1]. Set 0 = 25 and 
Xi = 0U;/(1—U;). According to exercise 7A5, you now have a sample 
of size 50 from the density 0/(0 + x). 


Find the MLE 6 by numerical maximization. 
Repeat 1000 times. 
Plot a histogram for the 1000 realizations of Ê. 


Calculate the mean and SD of the 1000 realizations of 6. How does the 
SD compare to 1/750 - lọ? (The Fisher information Io is computed in 
exercise 7A8.) Comment? 


For bonus points. Let t = (ô — 25) / SE, where SE is computed either 
from the Fisher information as in point 5, or from observed information. 
Which version of t is more like a normal distribution? 


Double or quits on bonus points. What happens to 6 if you double 0, 
from 25 to 50? What about Fisher information? observed information? 


Lab 12 
The Logit Model 


In this lab, you will fit a logit model, using data from the 2001 Current 


Population Survey. The data are in pacO1.dat The data cover 13,803 
individuals 16 years of age or older, in the five Pacific states of the US. 
The variables and file layout are explained in pacOldoc.txt in the same 
directory. 


The dependent variable Y is 1 if the personis employed and at work (LABSTAT 
is 1). Otherwise, Y = 0. The explanatory variables are age, sex, race, and 
educational level. The following categories should be used: 


Age: 16-19, 20-39, 40-64, 65 or above. 
Sex: male, female. (Not much choice about this one.) 
Race: white, non-white. 


Educational level: not a high school graduate, a high school education 
but no more, more than a high school education. 
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For the baseline individual in the model, choose a person who is male, non- 
white, age 16-19, and did not graduate from high school. 

1. What is the size of the design matrix? 
Use fminsearch to fit the model; report the parameter estimates. 
Estimate the SEs; use observed information. 


What does the model say about employment? 


A See eae ge 


Why use dummy variables for education, rather than EDLEVEL as a 
quantitative variable? 


6. For discussion. Why might women be less likely to have LABSTAT = 
1? Are LABSTAT codes over 4 relevant to this issue? 


Where should fminsearch start looking? Read section 7.2! How to com- 
pute the log likelihood function and its derivatives? Work exercises 7D9-10. 


MATLAB tip. If U and V are m xn matrices, then U<V is an m xn matrix of 
0’s and 1’s: there is a 1 in position (i, j) provided U(i,j) <V(i,j). 


Numerical Maximization 


Numerical maximization can be tricky. The more parameters there are, 
the trickier it gets. As a partial check on the algorithm, you can start the 
maximization from several different places. Another useful idea: if the com- 


puter tells you the max is at [1.4517 0.5334 0.8515 ...], start the 
search again—from a nearby point, like [1.5 0.5 0.8 ...]. 
Lab 13 


Simultaneous Equations 


In this lab, you will fit a model that has two simultaneous equations. 
The model is the one proposed by Rindfuss et al for determining a woman’s 
educational level (ED) and age at first birth (AGE). The model is described 
in section 9.5; variables are defined in table 9.1. The correlation matrix is 
shown at the top of the next page. Also see 


rindcor.dat 


In rindcor. dat, the upper right triangle is filled with 0’s: that way, MAT- 
LAB can read the file. You may need to do something about all those 0’s. 


OCC 


RACE 
NOSIB 
FARM 
REGN 
ADOLF 


REL 
YCIG 
FEC 
ED 
AGE 
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OCC RACE NOSIB FARM REGN ADOLF REL YCIG FEC ED 
1.000 
—.144 1.000 


—.244 156 1.000 
—.323  .088 .274 1.000 
—.129 .315 .150 .218 1.000 
—.056 .150 —.039 —.030 .071 1.000 
.053 —.152 .014 —.149 —.292 .052 1.000 
—.043 .030 .028 —.060 —.011 .067 —.010 1.000 
.037 .035 .002 —.032 —.027 .018 —.002 .009 1.000 
370 =222 —.328 —.185 —.211 .157 —.012 —.171  .038 1.000 
186 —.189 —.115 —.118 —.177 .111 .098 —.122 .216 .380 


Your mission, if you choose to accept it, is to estimate parameters in the 
standardized equations that explain ED and AGE. Variables are standardized 
to mean 0 and variance 1, so equations do not need intercepts. You do not have 
the original data, but can still use IVLS (section 9.2) or IISLS (section 9.4). 
IVLS might be easier. You have to translate equation (9.10) into usable form. 
For example, Z'X/n becomes the q x p matrix of correlations between the 
instruments and the explanatory variables. See section 6.1. 


Keeping track of indices is irritating. Here is a useful MATLAB trick. Num- 
ber the variables from 1 through 11: OCC is #1, ..., AGE is #11. Let X 
consist, e.g., of variables 11, 2 through 8, and 1 (i.e., AGE, RACE ,..., 
YCIG, OCC). How do you get the correlation matrix M for X from the corre- 
lation matrix C for all the variables in the system? Nothing is easier. You get 
C by loading rindcor. dat and filling in the upper triangular part. Then 


you type 

idx=[11 2:8 1]; 

M=C (idx’ , idx); 
(Here, idx is just a name—ID numbers of variables in X.) Let Z consist, 
e.g., of variables 9, 2 through 8, and 1 (i.e., FEC, RACE, ..., YCIG, OCC). 
How do you get the matrix L of correlations between Z and X? You define 


idz—that’s part of your job—then type L=C (idz‘ , idx). There is a row 
of L for each variable in Z, and a column for each variable in X. 


Any comment on the coefficients of the control variables (OCC, ..., FEC)? 


For bonus points 


1. Rindfuss et al is reprinted at the back of the book. If your results differ 
from those in the paper (table 2), why might that be? 


AGE 


1.000 
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2. Find the asymptotic SEs. Hint: in equation (9.14), 


IY — XArvisll? = IYI? + Bivis(X’X) Bivis — 2(Y'X) Avs - 


Lab 14 and Beyond 
Additional Topics 


Additional labs can be based on the data-snooping simulation (section 
5.8, end notes to chapter 5); on discussion questions 10-15 in chapter 4, 6 in 
chapter 5, or 3, 6, 15 in chapter 6; on bootstrap example 4 in chapter 8; and on 
the IVLS simulations in section 9.8. Replicating table 8.1 is also a worthwhile 
activity. However, it takes a fair amount of coding effort to replicate column F, 
and the resulting code may run very, very slowly: code should be debugged 
using a small number of bootstrap replicates. An interesting supplementary 
question: which is a better estimator for 8B, OLS or one-step GLS? In graduate 
courses, there is a useful supplement to Lab 5. 


The t-Test. Part 3. 


Suppose the true model behind Yule’s data is 
APaup; = a+b, x AOut;+b2 x (AOut;)*-+cx APop;+d x AOld;+é;, (+) 


where the e; are IID N(0, o?) with o = 10. However, Yule fits the linear 
model with b2 constrained to 0, that is, he assumes 


APaup; = a+b x AOut; +c x APop; +d x AOld; + €i. (E) 


How big would |b2| have to be to find the mistake, by looking at the residual 
plot for the fit to (+)? 

Try to answer this question for the special case where all other parameters 
are fixed: a = 13, bı = 0.8, c = —0.3, d = 0.1. Choose a value for b2 
and generate 32 errors €; from the normal distribution. Use this model to 
construct simulated data on APaup. Now regress the simulated APaup on 
AOut, APop, AOld and plot the residuals. You will need to make several 
plots for each trial value of b2. (Why?) 


1. Do standard errors take specification error into account? 


2. Do the standard errors for the coefficients in Yule’s model (+) measure 
the uncertainty in predicting the results of intervention? 


3. What are the implications? 
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4. If you knew that the only plausible alternative to (+) was (7), how would 
you decide between the two specifications? 


Terminology. A “specification” says what variables go into a model, 
what the functional form is, and what should be assumed about the disturbance 
term (or latent variable); if the data are generated some other way, that is 
“specification error” or “misspecification.” 


Statistical Packages 


The labs are organized to help you learn what’s going on underneath the 
hood when you fit a model. Statistical packages are organized to help you fit 
standard models with a minimum of fuss—although software designers have 
their own ideas about what “fuss” should mean to the rest of us. Recom- 
mended packages include the MATLAB statistics toolbox, R, and SAS. For 
instance, in release 13 of the MATLAB toolbox, you can fit a probit model 
by the command 


glmfit (X, [Y ones(n,1)], ‘binomial’,’probit’) 


Here, X is the design matrix, and Y is the response variable. MATLAB thinks 
of [Y ones (n,1)] as describing n binomial variables, each with 0 or 1 
success out of 1 trial. The first column in [Y ones (n,1)] tells it the 
number of successes, and the second column tells it the number of trials. 
There is a quirk in the code: you don’t put a column of 1’s into X. MATLAB 
will do this for you, and two columns of 1’s is one too many. In version 1.9.0 
of R, 


glm(Y~X1+X2, family=binomial (link=probit) ) 


will fit a probit model. The response variable is Y, as above. There are two 
independent variables, X1 and X2. Again, an intercept is supplied for you. 
The formula with the tilde, Y~X1+X2, is just R’s way of describing a model 
to itself: the dependent variable is Y; and there are two explanatory variables, 
X1 and X2. The family=binomial (link=probit) tells it you have 
a binomial response variable and want to fit a probit model. (Before you 
actually do this, please read An Introduction to R—click on Help in the R 
console, then on Manuals.) 

What about statistical tables? The MATLAB statistics toolbox has “cdf” 
and “icdf” functions that replace printed tables for the normal, t, F, and a 
dozen other classical distributions. In R, check the section called “R as a set 
of statistical tables” in An Introduction to R. Looking things up in printed 
statistical tables is now like using a slide rule to multiply numbers. 


Appendix: Sample MATLAB Code 


This program has most of the features you will need during the semester. 


It loads a data file small.dat listed at the end. It calls a function file 


P 
A 


hi .m also listed at the end. 

script file—demolab.m 

demolab.m 

a line that starts with a percent sign 

is a comment 

at the UNIX prompt, type matlab... 

you will get the matlab prompt, >> 

you can type edit to get an editor 

help to get help 

helpdesk for a browser-based help facility 


emergency stop is .... control-c 


how to create matrices 
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disp(’CR means carriage-return-- the "enter" key’) 
qq=input (‘hit cr to see some matrix arithmetic’); 


this is a way for the program to get input, 
here it just waits until you press the enter key, 
so you can look at the screen.... 


AP oP ol? 


names can be pretty long and complicated 


ol? 


twice_x=2*x 
x_plus_y=x+y 

transpose _x=x’ 
transpose_x_times_y=x’' *y 


qq=input (‘hit cr to see determinants and inverses’) ; 


determinant_of_xTy=det (x! *y) 
inverse_of _xTy=inv (x’ *y) 


disp(’hit cr to see coordinatewise multiplication, ’) 
qq=input (’division, powers.... ‘'); 


x_dotstar_y=x.*y 

X_over_y=x./y 

x_squared=x.~2 

qq=input (‘hit cr for utility matrices '); 
ZZZ=zeros (2,5) 

WON=ones (2,3) 


ident=eye (3) 


disp(‘hit cr to put matrices together--’) 
qq=input (’concatenation-- use [ ] '); 


concatenated= [ones (3,1) x y] 


qq=input (‘hit cr to graph log(t) against t ... '); 
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t=[.01:.05:10]’; 
% start at .01, go to 10 in steps of .05 


plot (t,log(t),’x’) 
disp(’look at the graph!!!’) 
disp(’ ') 

disp(’ ') 


disp (‘loops’) 

disp(’if ... then... 7) 

disp(’‘MATLAB uses == to test for equality’) 
disp (’MATLAB will print the perfect squares’ ) 
disp(’from 1 to 50’) 

qq=input (‘hit cr to go .... '); 


for j=1:50 sets up a loop 
if j==fix(sqrt(j))72 
found_a_perfect_square=j 


fix gets rid of decimals, 
% fix(2.4)=2, fix(-2.4)=-2 


o9 


end %Sgotta end the "if" 


end send the loop 


o 


% spaces and indenting make the code easier to read 
qq=input (‘hit cr to load a file and get summaries’) ; 
load small.dat 


ave_cols_12=mean(small(:,1:2) ) 
SD_cols_12=std(small(:,1:2)) 


ol? 


small(:,1) is the first column of small... 
that is what the colon does 

small(:,1:2) is the first two columns 
matlab divides by n-1 when computing the SD 


ole 


oe ol? 


u=small(:,3); 
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v=small(:,4); 
% the semicolon means, don’t print the result 


qq=input (‘hit cr for a scatterplot... ‘'); 
plot (u,v,’x’) 


correlation_matrix_34=corrcoef (u,v) 
% look at top right of the matrix 
% for the correlation coefficient 
disp(‘hit cr to get correlations’ ) 
qq=input (‘between all pairs of columns '); 


all_corrs=corrcoef (small) 
qq=input (‘hit cr for simulations '); 


uniform_random_numbers=rand (3, 2) 
normal_random_numbers=randn (2, 4) 


disp(’so, what is E(cos(Z)|Z>0) when Z is N(0,1)?’) 
qq=input (‘hit cr to find out “Y; 

Z=randn (10000,1); 

f=find(Z>0); 

EcosZ_given_Z_is_positive=mean (cos(Z(f))) 
trickier=mean (cos (Z(Z>0))) 


disp(’come let us replicate, ’) 

qq=input (‘might be sampling error, hit cr '); 
Z=randn (10000,1); 

f=find(Z>0); 

first -shot _was=EcosZ_given-Z_is_positive 
replicate=mean (cos (Z(f))) 


disp ('guess there is sampling error....’) 
disp(’ ') 


disp(’ ') 
disp(’ ') 
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disp(’MATLAB has script files and function files ’) 
disp(’‘mean and std are function files,’) 
disp(’mean.m and std.m ’) 

disp(’there is a function file phi.m’) 

disp(’that computes the normal curve’) 


qq=input (‘hit cr to see the graph '); 
u=[-4:.05:4]; 
plot (u, phi (u) ) 


A function file—phi.m 


% phi.m 

% save this in a file called phi.m 

% first line of code has to look like this... 
function y=phi (x) 


y=(1/sqrt (2*pi) ) *exp(-.5*x.%2); 
% at the end, you have to compute y-- 
% see first line of code 


small.dat 

1 2 2 4 

4 1 3 8.5 
2 2 5 1 

8 9 FT ibe OS 
3 3 4 2 

7 7 0.5 3 


Political Intolerance and Political Repression During the 
McCarthy Red Scare 


James L. Gibson, University of Houston 


Abstract 


I test several hypotheses concerning the origins of political repression in the 
states of the United States. The hypotheses are drawn from the elitist theory 
of democracy, which asserts that repression of unpopular political minorities 
stems from the intolerance of the mass public, the generally more tolerant 
elites not supporting such repression. Focusing on the repressive legislation 
adopted by the states during the McCarthy era, I examine the relationships 
between elite and mass opinion and repressive public policy. Generally it 
seems that elites, not masses, were responsible for the repression of the era. 
These findings suggest that the elitist theory of democracy is in need of sub- 
stantial theoretical reconsideration, as well as further empirical investigation. 
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Over three decades of research on citizen willingness to “put up with” polit- 
ical differences has led to the conclusion that the U.S. public is remarkably 
intolerant. Though the particular political minority that is salient enough to 
attract the wrath of the public may oscillate over time between the Left and 
the Right (e.g., Sullivan, Piereson, and Marcus 1982), generally, to be much 
outside the centrist mainstream of U.S. politics is to incur a considerable risk 
of being the object of mass political intolerance. 

At the same time, however, U.S. public policy is commonly regarded as 
being relatively tolerant of political minorities. Most citizens believe that all 
citizens are offered tremendous opportunities for the expression of their polit- 
ical preferences (e.g., McClosky and Brill 1983, 78). The First Amendment 
to the U.S. Constitution is commonly regarded as one of the most uncompro- 
mising assertions of the right to freedom of speech to be found in the world 
(“Congress shall make no law ...”). Policy, if not public opinion, appears to 
protect and encourage political diversity and competition. 

The seeming inconsistency between opinion and policy has not gone un- 
noticed by scholars. Some argue that the masses are not nearly so intolerant 
as they seem, in part due to biases in the questions used to measure intoler- 
ance (e.g., Femia 1975) and in part because the greater educational opportu- 
nity of the last few decades has created more widespread acceptance of polit- 
ical diversity (e.g., Davis 1975; Nunn, Crockett, and Williams 1978). Most, 
however, are willing to accept at face value the relative intolerance of the 
mass public and the relative tolerance of public policy but to seek reconcil- 
iation of the seeming contradiction by turning to the processes linking opin- 
ion to policy. Public policy is tolerant in the United States because the pro- 
cesses through which citizen preferences are linked to government action do 
not faithfully translate intolerant opinion inputs into repressive policy outputs. 
Just as in so many other substantive policy areas, public policy concerning the 
rights of political minorities fails to reflect the intolerant attitudes of the mass 
public. 

Instead, the elitist theory of democracy asserts, policy is protective of po- 
litical minorities because it reflects the preferences of elites, preferences that 
tend to be more tolerant than those of the mass public. For a variety of rea- 
sons, those who exert influence over the policymaking process in the United 
States are more willing to restrain the coercive power of the state in its deal- 
ings with political opposition groups. Thus there is a linkage between policy 
and opinion, but it is to tolerant elite opinion, not to intolerant mass opinion. 
Mass opinion is ordinarily not of great significance; public policy reflects elite 
opinion and is consequently tolerant of political diversity. The democratic 


GIBSON ON MCCARTHY 317 


character of the regime is enhanced through the political apathy and immo- 
bility of the masses, according to the elitist theory of democracy.! 

The elitist theory nonetheless asserts that outbreaks of political 
repression—when they occur—are attributable to the mass public. While the 
preferences of ordinary citizens typically have little influence over public 
policy—in part, perhaps, because citizens have no real preferences on most 
civil liberties issues—there are instances in which the intolerance of the mass 
public becomes mobilized. Under conditions of perceived threat to the status 
quo, for example, members of the mass public may become politically ac- 
tive. In the context of the general propensity toward intolerance among the 
mass public, mobilization typically results in demands for political repres- 
sion. Thus, the elitist theory of democracy hypothesizes that political repres- 
sion flows from demands from an activated mass public. 

The theory of “pluralistic intolerance’”—recently proposed by Sullivan, 
Piereson, and Marcus (1979, 1982) and Krouse and Marcus (1984)—provides 
a nice explanation of the process through which mass intolerance is mobi- 
lized (see also Sullivan et al. 1985). The theory asserts that one of the primary 
causes of political repression is the focusing of mass intolerance on a specific 
unpopular political minority. To the extent that intolerance becomes focused, 
it is capable of being mobilized. Mobilization results in demands for politi- 
cal repression, demands to which policy makers accede. The authors claim 
support for their theory from recent U.S. history: 


“During the 1950s, the United States was undoubtedly a society characterized by 
considerable consensus in target group selection. The Communist Party and its 
suspected sympathizers were subjected to significant repression, and there seemed 
to be a great deal of support for such actions among large segments of the political 
leadership as well as the mass public. . . . The political fragmentation and the pro- 
liferation of extremist groups in American politics since the 1950s has undoubt- 
edly resulted in a greater degree of diversity in target group selection. If this is the 
case, such a situation is less likely to result in repressive action, even if the mass 
public is roughly as intolerant as individuals as they were in the 1950s (Sullivan, 
Piereson, and Marcus 1982, 85, emphasis in original).” 


Thus both the elitist theory of democracy and the theory of pluralistic intol- 
erance are founded upon assumptions about the linkage between opinion and 
policy. 

Despite the wide acceptance of the elitist theory of democracy, there has 
been very little empirical investigation of this critical linkage between opinion 
and policy.” Consequently, this research is designed as an empirical test of the 
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policy implications of the widespread intolerance that seems to characterize 
the political culture of the United States. Using data on elite and mass opinion 
and on public policy in the states, the linkage hypothesis is tested. My focus is 
on the era of the McCarthy Red Scare, due to its political and theoretical im- 
portance. Thus I assess whether there are any significant policy implications 
that flow from elite and mass intolerance. 


Public Policy Repression 
Conceptualization 


A major impediment to drawing conclusions about the linkage between po- 
litical intolerance and the degree of repression in U.S. public policy is that 
rigorous conceptualizations and reproducible operationalizations of policy re- 
pression do not exist. Conceptually, I define repressive public policy as statu- 
tory restriction on oppositionist political activity (by which I mean activities 
through which citizens, individually or in groups, compete for political power 
[cf. Dahl 1971]) upon some, but not all, competitors for political power.* For 
example, policy outlawing a political party would be considered repressive, 
just as would policy that requires the members of some political parties to 
register with the government while not placing similar requirements on mem- 
bers of other political parties. Though there are some significant limitations 
to this definition, there is utility to considering the absence of political re- 
pression (political freedom) as including unimpaired opportunities for all full 
citizens 


1. to formulate their preferences 

2. to signify their preferences to their fellow citizens and the government 
by individual and collective action 

3. to have their preferences weighted equally in the conduct of the govern- 
ment, that is, weighted with no discrimination because of the content or 
source of the preference (Dahl 1971, 1-2). 


That is the working definition to be used in this research. 


Operationalizing Political Repression—the 1950s 


There have been a few systematic attempts at measuring political repression 
as a policy output of government. Bilson (1982), for instance, examined the 
degree of freedom available in 184 polities, using as a measure of freedom 
the ratings of the repressiveness developed by Freedom House. Dahl provides 
system scores on one of his main dimensions of polyarchy (opportunities for 
political opposition) for 114 countries as they stood in about 1969 (Dahl 1971, 
232). In their various research reports Page and Shapiro (e.g., 1983) measure 
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civil rights and civil liberties opinions and policies in terms of the adoption of 
specific sorts of public policy. Typically, however, the endogenous concept in 
most studies of state policy outputs is some sort of expenditure variable. (See 
Thompson 1981 for a critique of this practice.) These earlier efforts can in- 
form the construction of a measure of political repression in the policy outputs 
of the American states. 

The measure of policy repression that serves as the dependent variable 
in this analysis is an index indicating the degree of political repression di- 
rected against the Communist party and its members during the late 1940s 
and 1950s. A host of actions against Communists was taken by the states, in- 
cluding disqualifying them from public employment (including from teaching 
positions in public schools); denying them access to the ballot as candidates, 
and prohibiting them from serving in public office even if legally elected; re- 
quiring Communists to register with the government; and outright bans on the 
Party. Forced registration was a means toward achieving these ends. 

Of the fifty states, twenty-eight took none of these actions against 
Communists.* Two states—Arkansas and Texas—banned Communists from 
the ballot and from public employment, as well as banning the Party itself and 
requiring that Communists register with the government. Another five states 
adopted all three measures against the Communists, but did not require that 
they register with the government. Pennsylvania, Tennessee, and Washing- 
ton did not formally bar Communists from public employment but did out- 
law the party and forbade its members from participating in politics. The re- 
maining twelve states took some, but not all, actions against the Communists. 
From these data, a simple index of political repression has been calculated. 
The index includes taking no action, banning Communists from public em- 
ployment, banning Communists from running candidates and holding pub- 
lic office, and completely banning Communists and the Communist Party. A 
“bonus” score of .5 was given to those states requiring that Communists reg- 
ister with the government.’ Table 1 shows the scores of the individual states 
on this measure. 

This measure can rightly be considered to be a valid indicator of political 
repression by the states.° In asserting this I do not gainsay that the state has 
the right—indeed, the obligation—to provide for its internal security. Conse- 
quently, statutes that prohibit such actions as insurrection do not necessarily 
constitute political repression. For instance, Texas made it unlawful to “com- 
mit, attempt to commit, or aid in the commission of any act intended to over- 
throw” the Texas government (Art. 6689-3A, Sec. 5). This section proscribes 
action, not thought or speech, and is therefore not an appropriate measure 
of political repression. However, the next subsection of the statute made it 
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illegal to “advocate, abet, advise, or teach by any means any person to 
commit” a revolutionary act. Indeed, even conspiracy to advocate is prohib- 
ited (Art.6889-3A, Sec.5 [3]). This is indeed a constraint on the speech of po- 
litical minorities and therefore is treated as repressive. As the action prohib- 
ited moves beyond a specific, criminal behavior, the line between repressive 
and nonrepressive legislation becomes less clear. Gellhorn (1952) commented, 


“Traditionally the criminal law has dealt with the malefactor, the one who himself 
committed an offense. Departing from this tradition is the recent tendency to as- 
cribe criminal potentialities to a body of persons (usually, though not invariably, 
the Communists) and to lay restraints upon any individual who can be linked with 
the group. This, of course, greatly widens the concept of subversive activities, 
because it results, in truth, in forgetting about activities altogether. It substitutes 
associations as the objects of the law’s impact. Any attempt to define subversion 
as used in modern statutes must therefore refer to the mere possibility of activity 
as well as to present lawlessness.” (p. 360). 


There can be little doubt as to the effectiveness of this anti-Communist 
legislation. Not only were the Communist Party U.S.A. and other Communist 
parties essentially eradicated, but so too were a wide variety of non- 
Communists. It has been estimated that of the work force of 65 million, 
13 million were affected by loyalty and security programs during the 
McCarthy era (Brown 1958). Brown calculates that over 11 thousand individ- 
uals were fired as a result of government and private loyalty programs. More 
than 100 people were convicted under the federal Smith Act, and 135 people 
were cited for contempt by the House Un-American Activities Committee. 
Nearly one-half of the social science professors teaching in universities at the 
time expressed medium or high apprehension about possible adverse reper- 
cussions to them as a result of their political beliefs and activities (Lazarsfeld 
and Thielens 1958). Case studies of local and state politics vividly portray the 
effects of anti-Communist legislation on progressives of various sorts (e.g., 
Carleton 1985). The “silent generation” that emerged from McCarthyism is 
testimony enough to the widespread effects—direct and indirect—of the po- 
litical repression of the era (see also Goldstein 1978, 369-96). 

Nor was the repression of the era a function of the degree of objective 
threat to the security of the state. Political repression was just as likely to oc- 
cur in states with virtually no Communists as it was to occur in states with 
large numbers of Communists.’ The repression of Communists bore no rela- 
tionship to the degree of threat posed by local Communists. 

It might seem that the repression of Communists, though it is clearly 
repression within the context of the definition proffered above, is not neces- 
sarily “antidemocratic” because the objects of the repression are themselves 


Table 1. Political Repression of Communists by American State Governments 


Banned from Banned from Banned Scale 
State Public Employment Politics Outright Score 
Arkansas Yes Yes Yes 3.5 
Texas Yes Yes Yes 3.5 
Arizona Yes Yes Yes 3.0 
Indiana Yes Yes Yes 3.0 
Massachusetts Yes Yes Yes 3.0 
Nebraska Yes Yes Yes 3.0 
Oklahoma Yes Yes Yes 3.0 
Pennsylvania No Yes Yes 3.0 
Tennessee No Yes Yes 3.0 
Washington No Yes Yes 3.0 
Alabama Yes Yes No 2.5 
Louisiana Yes Yes No 25 
Michigan Yes Yes No 2.5 
Wyoming Yes Yes No 2.5 
Florida Yes Yes No 2.0 
Georgia Yes Yes No 2.0 
Illinois Yes Yes No 2.0 
California Yes No No 1.0 
New York Yes No No 1.0 
Delaware No No No 5 
Mississippi No No No .5 
New Mexico No No No 5 
Alaska No No No 0 
Colorado No No No 0 
Connecticut No No No .0 
Hawaii No No No .0 
Iowa No No No .0 
Idaho No No No .0 
Kentucky No No No .0 
Kansas No No No .0 
Maryland No No No 0 
Maine No No No 0 
Minnesota No No No 0 
Missouri No No No 0 
Montana No No No 0 
North Carolina No No No 0 
North Dakota No No No 0 
New Hampshire No No No 0 
New Jersey No No No 0 
Nevada No No No 0 
Ohio No No No 0 
Oregon No No No 0 
Rhode Island No No No 0 
South Carolina No No No 0 
South Dakota No No No 0 
Utah No No No 0 
Vermont No No No 0 
Virginia No No No 0 
West Virginia No No No 0 
Wisconsin No No No 0 


Note: The scale score is a Guttman score. A “bonus” of .5 was added to the scale added to the 
scale if the state also required that Communists register with the government. See note 4 for 
details of the assignments of scores to each state. 
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“antidemocrats.” To repress Communists is to preserve democracy, it might 
be argued. Several retorts to this position can be formulated. First, for 
democracies to preserve democracy through nondemocratic means is illogi- 
cal because democracy refers to a set of means, as well as ends (e.g., Dahl 
1956, 1961, 1971; Key 1961; Schumpeter 1950). The means argument can 
also be judged in terms of the necessity of the means. At least in retrospect 
(but probably otherwise as well), it is difficult to make the argument that the 
degree of threat to the polity from Communists in the 1940s and 1950s in any 
way paralleled the degree of political repression (e.g., Goldstein 1978). Sec- 
ond, the assumption that Communists and other objects of political repres- 
sion are “antidemocratic” must be considered as an empirical question itself 
in need of systematic investigation. As a first consideration, it is necessary to 
specify which Communists are being considered, inasmuch as the diversity 
among those adopting—or being assigned—the label is tremendous. Merely 
to postulate that Communists are antidemocratic is inadequate. Third, the re- 
pression of Communists no doubt has a chilling effect on those who, while 
not Communists, oppose the political status quo. In recognizing the coercive 
power of the state and its willingness to direct that power against those who 
dissent, the effect of repressive public policy extends far beyond the target 


group. 


Public Opinion Intolerance 
Conceptualization 


“Political tolerance” refers to the willingness of citizens to support the ex- 
tension of rights of citizenship to all members of the polity, that is, to allow 
political freedoms to those who are politically different. Thus, “tolerance im- 
plies a willingness to ‘put up with’ those things that one rejects. Politically, it 
implies a willingness to permit the expression of those ideas or interests that 
one opposes. A tolerant regime, then, like a tolerant individual, is one that 
allows a wide berth to those ideas that challenge its way of life” (Sullivan, 
Piereson, and Marcus 1979, 784). Thus, political tolerance includes support 
for institutional guarantees of the right to oppose the existing regime, includ- 
ing the rights to vote, to participate in political parties, to organize politically 
and to attempt political persuasion. Though there may be some disagreement 
about the operationalization of the concept, its conceptual definition is rela- 
tively noncontroversial (see Gibson and Bingham 1982). 


Operationalization 


The simple linkage hypothesis is that where the mass public is more intoler- 
ant, state public policy is more repressive. Though the hypothesis is simple, 
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deriving measures of mass intolerance is by no means uncomplicated. In- 
deed, the study of state politics continually confronts the difficulty of deriving 
measures of state public opinion. Though there are five general alternatives— 
ranging from simulations to individual state surveys—the only viable option 
for estimating state-level opinion intolerance during the McCarthy era is to 
aggregate national surveys by state. 

The source of the opinion data is the Stouffer survey, conducted in 1954. 
This survey is widely regarded as the classic study that initiated inquiry into 
the political tolerance of elites and masses (even though earlier evidence ex- 
ists, e.g., Hyman and Sheatsley 1953). Two independent surveys were ac- 
tually conducted for Stouffer: one by the National Opinion Research Center 
(NORC) and the other by the American Institute for Public Opinion (AIPO- 
Gallup). This design was adopted for the explicit purpose of demonstrating 
the accuracy and reliability of public opinion surveys based on random sam- 
ples. Each agency surveyed a sample of the mass public and of the political 
elites.® 

Stouffer created a six-point scale to indicate political intolerance (see 
Stouffer 1955, 262-69). The index is a Guttman scale based on the responses 
to fifteen items concerning support for the civil liberties of Communists, so- 
cialists, and atheists (see Appendix for details). The items meet conventional 
standards of scalability and are widely used today as indicators of political 
tolerance (e.g., Davis 1975; Nunn, Crockett, and Williams 1978; McCutcheon 
1985; and the General Social Survey, conducted annually by NORC). 

The process of aggregating these tolerance scores by state is difficult be- 
cause the states of residence of the respondents in the Stouffer surveys were 
never entered in any known version of the data set. Through an indirect pro- 
cess, using the identity of the interviewer and the check-in sheets used to 
record the locations (city and state) of the interviews conducted by each in- 
terviewer, state of residence could be ascertained for the NORC half of the 
Stouffer data set. The respondents were aggregated by state of residence to 
create summary indicators of the level of intolerance in each of the states. The 
Appendix reports the means, standard deviations, and numbers of cases and 
primary sampling units for this tolerance scale for the states represented in the 
NORC portion of the Stouffer survey. Evidence that this aggregation process 
produces reasonably valid state-level estimates of political intolerance is also 
presented. 

Aggregating the elite interviews to the state level is in one sense more 
perilous and in another sense less perilous. With a considerably small number 
of subjects (758 in Stouffer’s NORC sample), the means become more un- 
stable. On the other hand, the aggregation is not done for the purpose of est- 
imating some sort of elite population parameter. The elites selected were in 
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no sense a random sample of state elites, so it makes little sense to try to make 
inferences from the sample to some larger elite population. Instead, the elite 
samples represent only themselves. The Appendix reports the state means, 
standard deviations, and numbers of cases. 

There is a moderate relationship between elite and mass opinion in the 
state (r = .52). To the extent that we would expect elite and mass opinion in 
the states to covary, this correlation serves to validate the aggregate measures 
of opinion. The substantive implications of this correlation are considered 
below. 


The Simple Relationship between Opinion and Policy 


Figure 1 reports the relationships between mass and elite political intolerance 
and the adoption of repressive public policies by the states. There is a mod- 
est bivariate relationship during the McCarthy era between mass opinion and 
repressive public policy. In states in which the mass public was more intol- 
erant, there tended to be greater political repression, thus seeming to support 
the elitist theory. However, the relationship is somewhat stronger between 
elite opinion and repression. From a weighted least squares analysis incor- 
porating both elite opinion and mass opinion, it is clear that it is elite prefer- 
ences that most influence public policy. The beta for mass opinion is —.06; 
for elite opinion, itis —.35 (significant beyond .01).° Thus political repression 
occurred in states with relatively intolerant elites. Beyond the intolerance 
of elites, the preferences of the mass public seemed to matter little. 


Figure 1. Relationships between Opinion and Policy 


Mass 
Tolerance 


-52 (26) Repression 


Elite 
Tolerance 


Note: Boldfaced entries are bivariate correlation coefficients, with pairwise 
missing data deletion. The nonboldfaced entries are standardized regression 
coefficients from a weighted least squares analysis using listwise missing 
data deletion. The numbers of caste are shown in parentheses. 
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Table 2. The Influence of Elite and Mass Opinion on the Repression of 
Communists (Percentages) 


Elite Opinion Less Tolerant Elite Opinion More Tolerant 


Mass Opinion Mass Opinion Mass Opinion Mass Opinion 
Action Less Tolerant More Tolerant less Tolerant More Tolerant 


Adopted repressive 


legislation 71 100 33 39 
Did not adopt 

repressive legislation 29 0 67 62 
Total 100 100 100 101* 
Number of cases 7 3 3 13 


* Does not total 100 because of rounding error. 


Table 2 reports a cross-tabulation of policy outputs with elite and mass 
opinion. The opinion variables have been dichotomized at their respective 
means. Though the number of cases shown in this table is small—demanding 
caution in interpreting the percentages—the data reveal striking support for 
the conclusion that elite opinion, not mass opinion, determines public policy. 
In eight of the ten states in which elites were relatively less tolerant, repressive 
legislation was adopted. In only six of the sixteen states in which elites were 
relatively more tolerant was repressive legislation passed. Variation in mass 
opinion makes little difference for public policy.'° 

It is a little surprising that elite opinion has such a significant impact on 
policy repression. After all, elites tend to be relatively more tolerant than the 
masses. Indeed, this finding is the empirical linchpin of the elitist theory of 
democracy.'! This leads one to wonder just how much intolerance there was 
among the elites in the Stouffer data. 

The survey data in fact reveal ample evidence of elite intolerance. For 
instance, fully two-thirds of the elites were willing to strip admitted Com- 
munists of their U.S. citizenship (Stouffer 1955, 43). Indeed, one reading 
of the Stouffer data is that elites and masses differed principally on the de- 
gree of proof of Communist party membership necessary before repression 
was thought legitimate. Much of the mass public was willing to accept a very 
low level of proof of party membership (e.g., innuendo), while many elites 
required a legal determination of Communist affiliation. Once convinced of 
the charge, however, elites were very nearly as intolerant of Communists as 
members of the mass public. Just as McClosky and Brill (1983) have more 
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recently shown significant intolerance within their elite samples, there is 
enough intolerance among these state elites to make them the driving force 
in the repression of Communists. Thus it is plausible that elite intolerance 
was largely responsible for the repressive policies of the era. 

At the same time, there is little evidence that the communism issue was 
of burning concern to the U.S. public. For instance, Stouffer reported that 
“the number of people who said [in response to an open-ended question] that 
they were worried either about the threat of Communists in the United States 
or about civil liberties was, even by the most generous interpretation of oc- 
casionally ambiguous responses, less than 1%” (Stouffer 1955, 59, empha- 
sis in original). Only one-third of the subjects reported having talked about 
communism in the United States in the week prior to the interview, despite 
the fact that the Army-McCarthy hearings were in progress during a portion 
of the survey period. Stouffer asserted, “For most people neither the inter- 
nal Communist threat nor the threat to civil liberties was a matter of universal 
burning concern. Such findings are important. They should be of interest to a 
future historian who might otherwise be tempted, from isolated and dramatic 
events in the news, to portray too vividly the emotional climate of America 
in 1954” (Stouffer 1955, 72). 

The issue of communism in the United States was of much greater con- 
cern to the elites. Nearly two-thirds of them reported having talked about 
communism in the United States during the week prior to the interview. When 
asked how closely they followed news about Communists, fully 44% of the 
mass sample responded “hardly at all,’ while only 13% of the elite sample 
was as unconcerned (Stouffer 1955, 84). Just as elites typically exhibit greater 
knowledge and concern about public issues, they were far more attentive to 
the issue of domestic Communists. 

Thus it is difficult to imagine that the repression of the 1950s was in- 
spired by demands for repressive public policy from a mobilized mass pub- 
lic. Indeed, the most intense political intolerance was concentrated within that 
segment of the mass public least likely to have animpacton public policy (see 
also Gibson 1987). There can be no doubt that the mass public was highly 
intolerant in its attitudes during the 1950s. Absent issue salience, however, it 
is difficult to imagine that the U.S. people had mobilized sufficiently to have 
created the repression of the era.!* 

The actual effect of mass opinion may be masked a bit in these data, 
however. Perhaps it is useful to treat mass intolerance as essentially a con- 
stant across the states during the McCarthy era. Because the mass public was 
generally willing to support political repression of Communists, elites were 
basically free to shape public policy. In states in which the elites were 
relatively tolerant, tolerant policy prevailed. Where elites were relatively less 


GIBSON ON MCCARTHY 327 


tolerant, repression resulted. In neither case did mass opinion cause public 
policy. Instead, policy was framed by the elites. Nonetheless, the willing- 
ness of the mass public to accept repressive policies was no doubt important. 
Thus, the policy-making process need not be seen as a “demand-—input” pro- 
cess with all its untenable assumptions but rather can be seen as one in which 
the preferences of the mass public—perhaps even the political culture of the 
state—set the broad parameters of public policy. In this sense, then, mass po- 
litical intolerance “matters” for public policy. 

We must also note that even if the broader mass public has little influ- 
ence upon public policy, specialized segments of the public may still be im- 
portant. For instance, there is some correlation (r = .31) between the number 
of American Legion members in the state and political repression. ! Since the 
American Legion had long been in the forefront of the crusade against com- 
munism (see, e.g., American Legion 1937), it is likely that greater numbers 
of members in the state translated into more effective lobbying power. Thus 
particular segments of the mass public can indeed be mobilized for repressive 
purposes. 

I should also reemphasize the strong correlation between elite opinion 
and mass opinion. This correlation may imply that elites are responsive to 
mass opinion or that they mold mass opinion or that elite opinion is shaped 
by the same sort of factors as shape mass opinion. Though it is not possible to 
disentangle the causal process statistically, there is some evidence that both 
elite and mass opinion reflect the more fundamental political culture of the 
state. The correlation between a measure of Elazar’s state-level political cul- 
ture and mass intolerance is —.68; for elite opinion the correlation is —.66. 
In states with more traditionalistic political cultures both mass and elites tend 
to be more intolerant. Moreover, there is some direct relationship between 
political culture and political repression (r = .31). Perhaps elite and mass 
preferences generally reflect basic cultural values concerning the breadth of 
legitimate political participation and contestation. In the moralistic political 
culture everyone should participate; only professionals should be active in the 
individualistic culture; and only the appropriate elite in traditionalistic polit- 
ical cultures (Elazar 1972, 101-2). Perhaps the political culture of the state 
legitimizes broad propensities toward intolerance, propensities that become 
mobilized during political crises. 

One might also look at the data in Figure 1 from a very different perspec- 
tive. Rather than mass opinion causing public policy, perhaps mass opinion 
is caused by policy (cf. Page, Shapiro, and Dempsey 1987). To turn the elitist 
theory on its head, it is quite possible that the U.S. mass public is in- 
tolerant precisely because they have been persuaded and reinforced by the 
intolerance of U.S. public policy. Through the intolerance of public policy, 
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citizens learn that it is acceptable, if not desirable, to repress one’s political 
enemies. Though I do not gainsay that there are significant norms in U.S. so- 
ciety supportive of political tolerance (see Sniderman 1975), in practice citi- 
zens have been taught by federal and state legislation that Communists should 
not be tolerated. It is not surprising that many citizens have learned the lesson 
well.!4 

This argument is somewhat at variance with those who argue that greater 
exposure to the dominant cultural norms in the United States contributes to 
greater political tolerance. If the norms are tolerant, then greater exposure 
should create tolerance. But greater awareness of repressive norms—as ex- 
pressed in public policies—should be associated with greater intolerance. 
Thus the result of political activism, high self-esteem, and other qualities that 
make us assimilate social norms will vary according to the nature of the norms 
(see Sullivan et al. 1985). 

The norms of U.S. politics are at once tolerant and intolerant. Certainly, 
no one can doubt that support for civil liberties is a widely shared value. The 
key question, however, is “civil liberties for whom?” The U.S. political cul- 
ture has long distinguished between “true Americans” and others and has 
always been willing to deny civil liberties to those who are “un-American.” 
Foreign “isms” have repeatedly become the bogeymen in ideological conflict 
in the United States. Thus, citizens learn that civil liberties are indeed important 
to protect, but only for those who have a “legitimate” right to the liberty. 

Thus the initial evidence is that political repression during the McCarthy 
era was most likely initiated by elites even if the mass public in most states 
would have acquiesced. These findings are not compatible with the elitist 
views that mass intolerance threatens democracy and that elites are the car- 
riers of the democratic creed. 


The Political Culture of Intolerance and Repression 


These findings may very well be limited to the specific historical era of 
McCarthyism. Due to the unavailability of historical data on elite and mass 
opinion it is difficult to judge whether earlier outbreaks of political repres- 
sion can also be attributed to elite intolerance. Building on the discussion of 
political culture above, however, it is possible to give this issue further con- 
sideration. 

Following World War I roughly one-half of the U.S. states adopted crim- 
inal syndicalism statutes.!° For example, the statute adopted by California 
shortly after World War I defined the crime as “any doctrine or precept 
advocating, teaching or aiding and abetting the commission of crime, sabotage 
(which word is hereby defined as meaning willful and malicious physical 
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damage or injury to physical property), or unlawful acts of force and vio- 
lence or unlawful methods of terrorism as a means of accomplishing a change 
in industrial ownership or control, or effecting any political change” (Calif. 
Statutes, 1919. Ch. 188, Sec. 1, p. 281). Though no opinion data exist for the 
1920s, it is possible to examine the relationship between state-level political 
culture and political repression during this earlier era. 

The correlation between state political culture and the adoption of crim- 
inal syndicalism statutes is .40 (NV = 50) indicating once again that more 
traditionalistic states were more likely to engage in political repression. That 
this correlation is slightly stronger than the coefficient observed for the 1950s 
might speak to the breakdown of homogeneous state cultures as the popula- 
tion became more mobile in the twentieth century. In any event, we see in this 
correlation evidence that the more detailed findings of the McCarthy era may 
not be atypical.!° 


Discussion 


What conclusions about the elitist theory of democracy and the theory of 
pluralistic intolerance does this analysis support? First, I have discovered no 
evidence that political repression in the U.S. stems from demands from 
ordinary citizens to curtail the rights and activities of unpopular political mi- 
norities. This finding differs from what is predicted by the elitist theory of 
democracy. Second, I find some evidence of elite complicity in the repression 
of the McCarthy era, a finding that is also incompatible with the eli- 
tist theory. Generally, then, this research casts doubt on the elitist theory of 
democracy. 

Nor are these findings necessarily compatible with the theory of plural- 
istic intolerance advocated by Sullivan, Piereson, and Marcus. Though polit- 
ical intolerance in the 1950s was widespread and highly focused, there seems 
to have been little direct effect of mass opinion on public policy. Like the eli- 
tist theory of democracy, the theory of pluralistic intolerance places too much 
emphasis on mass opinion as a determinant of public policy. 

The “demand-input” linkage process implicitly posited by these theories 
is probably their critical flaw. Early public opinion research that found high 
levels of mass political intolerance too quickly assumed that mass intolerance 
translated directly into public policy. The assumption was easy to make since 
little was known of the processes linking opinions with policy. As linkage re- 
search has accumulated, however, the simple hypothesis relating opinion to 
policy has become increasingly untenable. The justification for studying mass 
political tolerance therefore cannot be found in the hypothesis that survey re- 
sponses direct public policy. 
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At the same time, however, public opinion may not be completely irrel- 
evant. Tolerance opinion strongly reflects the political cultures of the states, 
and, at least in the 1950s, political culture was significantly related to levels 
of political repression. Opinion is important in the policy process because it 
delimits the range of acceptable policy alternatives. It may well be that mass 
opinion is manipulated and shaped by elites; nonetheless, those who would 
propose repressive policies in California face a very different set of political 
constraints than those who propose repressive policies in Arkansas. This is 
not to say that repression is impossible—indeed, California has a long history 
of significant levels of political repression—but rather that the task of gaining 
acceptance for repression is different under differing cultural contexts. 

For over three decades now, political scientists have systematically stud- 
ied public policy and public opinion. Significant advances have been made 
in understanding many sorts of state policy outputs, and we have developed a 
wealth of information about political tolerance. To date, however, little atten- 
tion has been given to repression as a policy output, and even less attention 
has been devoted to behavioral and policy implications of tolerance attitudes. 
The failure to investigate the linkage between opinion and policy is all the 
more significant because one of the most widely accepted theories in political 
science—the elitist theory of democracy—was developed on the basis of an 
assumed linkage between opinion and policy. I hope that this research, though 
only a crude beginning, will serve as an early step in continuing research into 
these most important problems of democracy. 


Appendix: Measurement and Aggregation Error in the State-Level 
Estimates of Mass Political Intolerance 


Measurement 


The measure of political tolerance employed here is an index originally con- 
structed by Stouffer. He used fifteen items to construct the scale. Eleven of 
the items dealt with communists; two with atheists (those who are against all 
churches and religion); and two with socialists (those favoring government 
ownership of all railroads and all big industries). Stouffer reported a coeffi- 
cient of reproducibility of .96 for the scale, a very high level of reliability. He 
also reported that reproducibility was approximately the same at all educa- 
tional levels. 

I decided to use Stouffer’s scale even though it includes items on atheists 
and socialists (1) in order to maintain comparability to Stouffer’s re- 
search, (2) because an identical scale was created from a survey in 1973 
that is very useful for assessment of aggregation error, and (3) because the 
scale is so reliable. Stouffer had a strong view of what his scale was measuring. 
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He asserted, “But again let it be pointed out, this scale does not 
measure ...tolerance in general. It deals only with attitudes toward certain 
types of nonconformists or deviants. It does not deal with attitudes toward 
extreme rightwing agitators, toward people who attack minority groups, to- 
ward faddists or cultists, in general, nor, of course, toward a wide variety of 
criminals. For purposes of this study, the tolerance of nonconformity or sus- 
pected nonconformity is solely within the broad context of the Communist 
threat” (Stouffer 1955, 54, emphasis in original). 

The Stouffer measures of tolerance have recently been criticized (e.g., 
Sullivan, Piereson, and Marcus 1982). Perhaps the most fundamental aspect 
of this criticism is the assertion that the Stouffer items measure tolerance only 
for a specific group and thus are not generalizable. Because Stouffer was con- 
cerned only about intolerance of Communists, his findings may be time- 
bound; as the objects of mass displeasure evolve, the Communist-based 
approach to tolerance becomes less relevant and useful. This difficulty does 
not affect my analysis of policy and opinion from the 1950s, however, because 
Communists were probably a major disliked group for nearly all citizens in 
the survey. For instance, only 256 out of 4,933 of the mass respondents were 
willing to assert that someone believing in communism could still be a loyal 
U.S. citizen. Even if Communists were not the least-liked group for all U.S. 
citizens, they were certainly located in the “disliked-enough-not-to-tolerate” 
range for nearly everyone. Thus the Stouffer measure of tolerance is a valid 
and reliable indicator. 


Aggregation Error 


Table A-1 reports the state-level means, standard deviations, and numbers 
of cases for the aggregation of elite and mass opinion. Not all states are in- 
cluded in Table A-1 because survey respondents were not located in every 
state. Since the Stouffer survey was not designed to be aggregated by state, 
it is necessary to try to determine whether there is any obvious bias in the 
state-level estimates. A few empirical tests can be conducted that, while not 
assuaging all doubts about the aggregation process, may make us somewhat 
more comfortable about using the state means. 

The Stouffer survey was replicated in 1973 by Nunn, Crockett, and 
Williams (1978). Their survey was very nearly an exact replication of the 
Stouffer survey. In terms of the indicators of tolerance, it was an exact repli- 
cation. Nunn, Crockett, and Williams were even extremely careful to repro- 
duce Stouffer’s scaling methodology in creating a summary index of 
intolerance (pp. 179-91). Thus it is possible to aggregate the same scale 
variable by state and derive a measure of political tolerance for the early 
1970s. 
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Table A-I. State Mean Tolerance Scores, Mass Public, and Elites, NORC 
Stouffer Survey, 1954 


Mass Public Elites 
Standard Number Number Standard Number 
State Mean Deviation ofCases ofPSUs Mean Deviation of Cases 
California 4.47 1.50 174 4 5.09 1.43 65 
Missouri 4.44 1.20 18 2 5.45 .69 11 
New Jersey 4.41 1.43 6l 1 4.90 1.28 60 
Washington 4.33 1.44 52 2 5.14 .66 14 
Towa 4.26 1.42 23 1 — — — 
Wisconsin 4.24 1.56 41 2 5.44 .87 25 
Massachusetts 4.22 1.47 81 2 4.51 1.21 41 
New York 4.21 1.40 273 6 5.06 1.08 81 
Oregon 4.20 1.47 15 1 - - - 
Colorado 4.13 1.46 23 1 5.29 1.33 14 
Connecticut 4.12 1.17 17 1 5.17 83 12 
Nebraska 4.06 1.24 16 1 4.40 1.35 10 
Minnesota 3.92 1.43 64 3 5.33 .96 27; 
Ohio 3.83 1.57 103 4 5.02 1.04 54 
Ilinois 3.81 1.55 86 2 4.97 1.39 39 
Nevada 3.77 1.61 31 1 — — — 
North Dakota 3.76 1.46 41 1 5.17 1.27 12 
Pennsylvania 3.75 1.41 179 6 4.77 1.29 43 
Michigan 3.75 1.34 163 4 4.92 1.26 38 
Kansas 3.64 1.26 59 2 - - - 
Florida 3.61 1.43 84 2 4.46 1.47 24 
New Hampshire 3.58 1.71 19 1 5.36 1.03 11 
Maryland 3.45 1.46 51 2 = = = 
Idaho 3.45 1.65 22 1 5.15 1.07 13 
Oklahoma 3.43 1.44 67 3 5.31 85 13 
Virginia 3.40 1.68 15 1 = = = 
Indiana 3.36 1.32 129 5 4.61 1.40 36 
Alabama 3.32 1.27 37 2 4.30 1.46 27 
Texas 3.28 1.05 156 5 4.30 1.49 40 
Louisiana 3.27 1.34 26 1 4.33 1.67 12 
North Carolina 3.17 1.17 65 3 3.60 1.90 10 
Tennessee 2.98 1.62 44 2 — — — 
Georgia 2.86 1.39 50 3 - - - 
Kentucky 2.86 1.25 22 1 4.77 1.39 26 
West Virginia 2.34 .90 29 2 - - - 
Arkansas 1.79 1.27 19 1 — — — 
Average 3.65 1.40 65 2.3 4.88 1.22 29 


With completely independent samples (including independent sampling 
frames), one would not expect that there would be much of a correlation be- 
tween the Stouffer and the Nunn, Crockett, and Williams state-level estimates. 
Chance fluctuations in the distributions of primary sampling units (PSUs) per 
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state would tend to attenuate the correlation between the state-level estimates. 
(The average number of PSUs in Stouffer’s NORC survey is 2.3; for the Nunn, 
Crockett, and Williams survey it is 7.8.) Yet the correlation between the es- 
timates from the two surveys is a remarkable .63 (N = 29). If I were to ex- 
clude the 1973 estimate for Connecticut, an estimate that shows that state to 
be quite intolerant, then the correlation increases to .77 (N = 28). It is diffi- 
cult to imagine an explanation for this correlation other than that it is due to 
a common correlation with the true score for the state. 

I have also investigated the relationship between state sample size and 
number of primary sampling units and aggregation error. I first assumed that 
differences between the f and f estimates of state opinion were due to aggre- 
gation error. The residuals resulting from regressing t2 opinion on ft; opinion 
represent this error; if squared, the residuals represent the total amount of er- 
ror. The correlations between the squared residuals and t; sample size and 
number of PSUs are —.30 and —.27. The correlations between the residuals 
and t sample size and number of PSUs are —.29 and —.29. These correla- 
tions indicate that aggregation error is larger in states in which the number of 
subjects and number of PSUs is smaller—a not unexpected finding. However, 
since the relationships are modest, they do not undermine the basic aggrega- 
tion procedure. 

Another bit of evidence supporting the aggregation process comes from 
the correlations of tolerance and political culture. The correlation between 
Elazar’s measure of political culture and average state tolerance in the 1950s 
is —.68. This correlation enhances my confidence in the utility of the state- 
level estimates. 

Another, very different tack that can be taken is to estimate the error 
associated with the aggregation process. For each survey, I aggregated the 
proportion of the respondents having twelve or more years of formal edu- 
cation. These percentages can be compared to census estimates of the level 
of education in the state. The comparison is not perfect due to two consid- 
erations. First, the census data are themselves population estimates drawn 
from survey samples. Second, the census reports the percentage of residents 
over the age of twenty-five with twelve or more years of education. I as- 
sume that those with twelve or more years of education have a high school 
degree, although this might not be true for every single respondent. Moreover, 
it is not possible to isolate those respondents twenty-five years and older in 
the Stouffer survey. Nonetheless, the correlation for the 1950s data between 
the survey and census estimates of education is a substantial .72 (N = 36). 
While this correlation does not speak directly to the utility of the state-level 
estimates of tolerance, it does suggest that aggregation from the survey to the 
state is not completely inappropriate. 
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The correlation between elite opinion in the 1950s and elite opinion in 
the 1970s is .25 (.28 with a minimum-number-of-respondents requirement). 
That the correlation is not higher is a bit worrisome, although it is not difficult 
to imagine that there is greater flux in elite opinion over the two decades sep- 
arating the two surveys than there is in mass opinion. Moreover, there were 
some slight differences in the composition of the elite samples drawn in 1954 
and 1973. 

As a means of assessing the validity of the aggregation of elite opinion, 
it is possible to compare elite tolerance with other elite attitudes. Erikson, 
Wright, and McIver (1987) have developed a separate measure of the degree 
of liberalism of state elites. The measure summarizes the ideological posi- 
tions of the state’s congressional candidates, state legislators, political 
party elites, and national convention delegates. As an overall index of the 
liberalism—conservatism of state elites, they take the average score of the 
Democrats and the Republicans. Thus each state receives a score indicating 
the degree of liberalism-conservatism of state elites. Though most of the in- 
dicators are drawn from the 1970s, the authors believe this to be a more stable 
attribute of state elites. According to their index, the most conservative elites 
are found in Mississippi; the most liberal elites are found in Massachusetts. 

The correlation of state elite conservatism and political tolerance is —.46 
(N = 26) for the Stouffer elites and —.22 (N = 29) for the Nunn, Crockett, 
and Williams elites. Though liberalism-conservatism is conceptually distinct 
from political tolerance, some solace can be taken in this correlation. The 
aggregation process seems not to have introduced unexpected or obviously 
biased estimates of state-level elite opinion. 
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tional Science Foundation, SES84-21037. NSF is not responsible for any of 
the interpretations or conclusions reported herein. For research assistance, I 
am indebted to David Romero, James P. Wenzel, and Richard J. Zook. This 
is a revised version of a paper delivered at the 1986 annual meeting of the 
American Political Science Association, Washington, D.C., 1986. Several 
colleagues have been kind enough to comment on an earlier version of this 
article, including Paul R. Abramson, David G. Barnum, Lawrence Baum, 
James A. Davis, Thomas R. Dye, Heinz Eulau, George E. Marcus, John P. 
McIver, Paul M. Sniderman, Robert Y. Shapiro, and Martin P. Wattenberg. I 
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am also indebted to Patrick Bova, librarian at NORC, for assistance with the 
Stouffer data. 

1. The elitist theory of democracy is actually an amalgam of the work 
of a variety of theorists, including Berelson, Lazarsfeld, and McPhee (1954); 
Kornhauser (1959); Lipset (1960); and Key (1961). The most useful anal- 
ysis of the similarities and differences among the theories can be found in 
Bachrach 1967. Some elite theorists emphasize the dominance and control 
of public policy by elites, while other theorists emphasize the antidemocratic 
tendencies of the mass public. The single view most compatible with the hy- 
potheses tested in this article is Kornhauser’s (1959). The hypotheses are also 
to be found in Dye and Zeigler 1987 (see also Dye 1976). Earlier empirical 
work on the tolerance of elites and masses includes Berelson, Lazarsfeld, and 
McPhee 1954; Lipset 1960; Prothro and Grigg 1960; and McClosky 1964. A 
more recent analysis of some of the propositions of elitist theory can be found 
in Gibson and Bingham 1984. 

2. Linkage research is fairly common in other areas of substantive policy 
(e.g., Erikson 1976; Weissberg 1978), but the only rigorous investigation of 
civil liberties is that of Page and Shapiro (1983). They assessed the relation- 
ship between change in opinion and change in policy, and found that in eight 
of nine policy changes in the area of civil liberties there was opinion-policy 
congruence. They also found that state policies were more likely to be con- 
gruent with opinion than national policies, although the relationship did not 
hold in the multivariate analysis. Though their analysis was conducted at the 
national level, their findings seem to suggest that political repression results 
from demands from the mass public. 

3. This is similar to Goldstein’s definition, “Political repression consists 
of government action which grossly discriminates against persons or organi- 
zations viewed as presenting a fundamental challenge to existing power re- 
lationships or key governmental policies, because of their perceived political 
beliefs” (1978, xvi). 

4. The source for these data is a 1965 study requested by a subcommit- 
tee of the Committee on the Judiciary in the U.S. Senate. See also Library 
of Congress, Legislative Reference Service, 1965; Gellhorn 1952; and Pren- 
dergast 1950. Care must be taken in using the Legislative Reference Service 
data because there are a variety of errors in the published report. Corrected 
data, based on an examination of all of the relevant state statutes, are available 
from the author. 

The scores shown in Table 1 reflect actions taken by the state govern- 
ments between 1945 and 1965. The decision to limit the policy measures to 
this period is based on the desire to have some temporal proximity between 
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the opinion and policy data. This decision has implications for the scores of 
three states. Kansas and Wisconsin both barred Communists from political 
participation in legislation adopted in 1941. This legislation is excluded from 
Table 1. Arkansas is shown as having banned Communists from public em- 
ployment, from politics and outright. Only the outright ban was adopted in 
the 1945-65 period. Because a complete ban necessarily excludes Commu- 
nists from public employment and from political participation, the score for 
Arkansas is shown as 3.5. 

5. These three items scale in the Guttman sense. That is, nearly all of 
the states outlawing the Communist party also denied it access to the ballot 
and public employment. Nearly all of the states that denied Communists ac- 
cess to the ballot as candidates also made them ineligible for public employ- 
ment. The registration variable does not, however, exhibit this pattern of cu- 
mulativeness. Registration seems to have been a means of enforcing a policy 
goal such as banning membership in the Party. Because registration can raise 
Fifth Amendment self-incrimination issues, some states chose not to require 
it. Statutes requiring registration are treated for measurement purposes as rep- 
resenting a greater degree of commitment to political repression, and for that 
reason the “bonus” points were added to the basic repression score. 

6. Validity means not only that measures of similar concepts converge; 
measures of dissimilar concepts must also diverge (Campbell and Fiske 1959). 
Thus it is useful to examine the relationship between the repression measures 
and measures of other sorts of policy outputs. Klingman and Lammers (1984) 
have developed a measure of the “general policy liberalism” of the states. 
General policy liberalism is a predisposition in state public policies toward 
extensive use of the public sector and is thought to be a relatively stable at- 
tribute. I would expect that political repression is not simply another form of 
liberalism, and indeed it is not. The correlation between general policy liber- 
alism and political repression during the 1950s is only —.18. Moreover, the 
relationship between repression and a measure of New Deal social welfare 
liberalism policy (see Holbrook-Provow and Poe 1987; Rosenstone 1983) is 
only —.22. Repression occurred in states with histories of liberalism just about 
as frequently as it did in states typically adopting conservative policies. Thus 
the measure of repression is not simply a form of political liberalism, a finding 
that contributes to the apparent validity of the measure. 

7. This conclusion is based on figures compiled by Harvey Klehr on the 
size of the Communist Party U.S.A. during the 1930s (Klehr 1984, tbl. 19.1 
and personal communication with the author, 21 May 1986). The data are 
from the Party’s own internal record. Klehr believes the data to be reason- 
ably reliable, and others seem to agree (see, e.g., Glazer 1961, 208, n. 3; and 
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Shannon 1959, 91). There is also a strong relationship between Party mem- 
bership and votes for Communist candidates for public offices in the 1936 
elections (as compiled by the American Legion 1937, 44), as well as a strong 
relationship with FBI estimates of Party membership in the states in 1951 
(U.S. Senate, Committee on the Judiciary 1956, 34). 

8. Stouffer defined elites as those who hold certain positions of influence 
and potential influence in local politics. The elite sample was drawn from 
those holding the following positions: community chest chairmen; school 
board presidents; library committee chairmen; Republican county chairmen; 
Democratic county chairmen; American Legion commanders; bar association 
presidents; chamber of commerce presidents; PTA presidents; women’s club 
presidents; DAR regents; newspaper publishers; and labor union leaders. 

9. Weighted least squares was used because I could not assume that the 
variances of the observations were equal. Following Hanushek and Jackson 
(1977, 151-52), I weighted the observations by the square root of the num- 
bers of respondents within the state. The r-square from this analysis is .14. 
The regression equation with unstandardized coefficients is: Y = 7.31 — 
.14(mass opinion) — 1.11 (elite opinion). 

10. The data in Table 2 suggest that where the state elites are relatively 
less tolerant, increases in mass tolerance are associated with an increase in po- 
litical repression. Caution must be exercised in interpreting the percentages, 
however, due to the small number of cases available. The data reveal that in 
five of the seven states with a relatively less tolerant mass public, repressive 
legislation was adopted, while in all three of the states with a relatively more 
tolerant mass public repressive legislation was adopted. In the context of the 
numbers of cases, I did not treat this difference as substantively significant. 

11. It might be argued that elite opinion serves only to neutralize intoler- 
ant mass opinion. This suggests an interactive relationship between elite and 
mass opinion. Tests of this hypothesis reveal no such interaction. The impact 
of elite opinion on public policy is not contingent upon the level of tolerance 
of the mass public in the state. 

12. Though it is a bit risky to do so, it is possible to break the policy 
variable into time periods according to the date on which the legislation was 
adopted. A total of sixteen states adopted repressive legislation prior to 1954; 
ten states adopted repressive legislation in 1954 or later. The correlations 
between pre-1954 repression and mass and elite tolerance, respectively, 
are —.05, and —.35. Where elites were more intolerant, policy was more re- 
pressive. Mass intolerance seems to have had little impact on policy. 

The correlations change rather substantially for the post- 1954 policy mea- 
sure. There is a reasonably strong correlation between mass intolerance and 
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repression (r = —.32) but little correlation with elite intolerance (r = —.13). 
If one were willing to draw conclusions based on what are surely relatively 
unstable correlations, based on limited numbers of observations, one might 
conclude that early efforts to restrict the political freedom of Communists 
were directed largely by elites, while later efforts were more likely to involve 
the mass public. The initiative for political repression therefore was with the 
elites, though the mass public sustained the repression once it was under way. 

At the same time, however, the slight correlation between pre-1954 pol- 
icy and mass intolerance suggests that mass opinion was not shaped by public 
policy. Where policy was more repressive, opinion was not more intolerant. 
The close temporal proximity here should give us pause in overinterpreting 
this correlation, however. 

13. Note that Stouffer found that the leaders of the American Legion 
were the most intolerant of all leadership groups surveyed (Stouffer 1955, 52). 
Indeed, the commanders interviewed were only slightly less intolerant than 
the mass public. 

14. At the same time, it should be noted that U.S. citizens became sub- 
stantially more tolerant of Communists by the 1970s (e.g., Davis 1975; Nunn, 
Crockett, and Williams 1978). This too might reflect changes in public policy, 
as well as elite leadership of opinion. As the U.S. Supreme Court invalidated 
some of the most repressive state and federal legislation of the McCarthy era, 
and as U.S. political leaders (including Richard Nixon) sought improved for- 
eign relations with Communist nations, it became less appropriate to support 
the repression of Communists. These comments illustrate, however, the diffi- 
culty of sorting out the interrelationships of opinion and policy and also reveal 
that many efforts to do so border on nonfalsifiability. 

15. Between 1917 and 1920, twenty-four states adopted criminal syndi- 
calism statutes. There is some ambiguity in published compilations about the 
number of states with such laws. Dowell (1969) lists twenty states with such 
legislation, not counting the three states that adopted but then repealed syn- 
dicalism laws. Dowell apparently overlooked Rhode Island, at least accord- 
ing to the compilations of Chafee (1967) and Gellhorn (1952). On the other 
hand, neither Chafee nor Gellhorn listed Colorado or Indiana as having such 
statutes (though Chafee did list the states that had repealed their legislation). 
This latter problem is in part a function of determining whether specific 
statutes should be classified as banning criminal syndicalism. By 1937, three 
states had repealed their statutes (although one of these—Arizona—apparently 
did so inadvertently during recodification). As of 1981, seven of these states 
still had the statutes on their books, and one additional state—Mississippi— 
had passed such legislation (Jenson 1982, 167-75). For purposes of this 
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analysis, Dowell’s twenty-three states and Rhode Island are classified as hav- 
ing criminal syndicalism laws as of 1920. 

16. It should also be noted that political culture is fairly stably related 
to mass political intolerance. Estimates of state opinion were derived from 
Roper data on an item about loyalty oaths asked in a 1937 survey. Opinion in 
more traditionalistic states was more supportive of mandatory loyalty oaths 
(r = —.44, N = 47). Similarly, the correlation between political culture and 
the state aggregates from the Stouffer replication in 1973 (see the Appendix) 
is —.58 (N = 35). These coefficients are nothing more than suggestive, but 
they do suggest that political intolerance is a relatively enduring attribute of 
state political culture. 
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I. Introduction 


More than ten years ago, James Coleman and his colleagues launched a na- 
tional debate over the relative quality of public and Catholic schools 
[Coleman and Hoffer 1987; Coleman, Hoffer, and Kilgore 1982]. Based on 
their analysis of the High School and Beyond (HS&B) data, they concluded 
that Catholic school students scored significantly higher than public school 
students on standardized tests, even after controlling for differences in fam- 
ily characteristics. Catholic schools in their study appeared to be particularly 
effective with minority students. 

Almost immediately, the Coleman results generated tremendous inter- 
est among both policy analysts and academics. Academic journals devoted 
special issues to their research on at least six different occasions (Harvard 
Education Review in 1991; Phi Beta Kappa in 1981; Education Researcher 
in 1981; and Sociology of Education in 1982, 1983, and 1985). Critics raised 
a number of issues about their work. Several papers showed that the es- 
timated magnitude of the Catholic school effect was very sensitive to the 
choice of other independent variables (Lee and Bryk 1988; Noell 1982]. A 
number of papers questioned whether the results were driven by a selection 
bias. Since parents decide whether to send their children to public or Catholic 
schools, it is inappropriate to estimate the effect of Catholic schools on test 
scores with a single-equation model that treats school choice as an exogenous 
variable [Goldberger and Cain 1982]. Others argued that the increase in test 
scores between sophomore and senior years was so small that the Coleman 
results had little relevance in the debate over school choice [Murnane 1984; 
Alexander and Pallas 1985; Witte 1992].! Based on his review of the 
Coleman work and subsequent studies, Cookson [1993, p. 181] concluded 
that “...once the background characteristics of students are taken into ac- 
count, student achievement is not directly related to private school atten- 
dance. The effects that were reported by Coleman and his associates are 
too small to be of any substantive significance in terms of incrementally 
improving student learning.” 

Most of Coleman’s work and virtually all of the research that followed 
focused on the effects of Catholic schools on test scores.” In some ways it 
is surprising that test scores have received so much attention while other 
important education outcomes have not. Test scores have obvious limitations. 
It has often been argued that standardized tests in general may be culturally, 
racially, and sexually biased. Teachers may “teach to the test” and thus inflate 
scores [Henig 1994]. On the other hand, students often gain little by doing 
well on an exam and thus may not take the exam seriously. Standardized 
tests can only measure a student’s ability to deal with a particular type of 
question and cannot measure a student’s creativity or deeper problem-solving 
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skills. The particular test included in the original Coleman work was a short 
and relatively simple exam, and the results may not be indicative of school 
performance. Perhaps most importantly, there is little evidence that raising 
test scores has important economic consequences. The impact of test scores 
on wages, for example, appears to be modest.’ 

This suggests that we consider alternative criteria to evaluate schools 
that have important economic consequences. Card and Krueger [1994] ar- 
gue that measures of educational attainment such as completing high school 
and going on to college are particularly useful measures of schools’ suc- 
cess. Unlike test scores, there is a great deal of evidence on the benefits of 
additional education. Only 65 percent of young male high school dropouts 
were employed in 1986 as compared with 85 percent of high school grad- 
uates [Markety 1988]. Between 1980 and 1985 the unemployment rate for 
males without a high school diploma was 35 percent higher than the rate for 
high school graduates and five times as large as the rate for college graduates 
[Murphy and Topel 1987]. The unemployment rate for young black males 
without high school degrees was over 40 percent for most of the 1980s. 
Wages and earnings are substantially lower for those high school dropouts 
who do find work. In 1987 the median yearly income for 25-to-34 year-old 
male full-time workers with a high school degree was 21.2 percent larger 
than the value for those who had not finished high school [Levy and Mur- 
nane 1992]. Hashimoto and Raisian [1985] and Weiss [1988] found that an 
extra year of education that leads to a high school degree has a much larger 
impact on wages than does an additional year of school that does not lead 
to a degree. Real wages for young male high school dropouts declined by 
23 percent between 1979 and 1988, while young male college graduates ex- 
perienced a 7 percent real wage increase over the same period [Bound and 
Johnson 1992]. High school dropouts are far more likely to commit crimes 
[Thornberry, Moore, and Christenson 1985] and to use illegal drugs [Mensch 
and Kandel 1988]. 

Thus, the debate over Catholic schools seems to have missed outcomes 
with important economic implications. In this paper we have gone back to 
the HS&B data and looked at the impact of a Catholic school education on the 
probability of, first, finishing high school, and, second, starting college. We 
have paid particular attention to the issue of selection bias. If students with 
more ability or students from families that place a higher value on educa- 
tion are more likely to attend Catholic schools, then single-equation models 
would overstate the effects of a Catholic school education. Therefore, the 
appropriate model must take this endogeneity into account. Because both of 
our outcome measures and the treatment variable (a Catholic school dummy) 
are dichotomous, we estimate a set of bivariate probit models. 
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Our major conclusions are as follows. We find a great deal of support 
for the argument that Catholic schools are more effective than public schools. 
Single-equation estimates suggest that for the typical student, attending a 
Catholic high school raises the probability of finishing high school or enter- 
ing a four-year college by thirteen percentage points. Unlike single-equation 
estimates of the effect of Catholic schools on test scores, these results are 
qualitatively important and are robust. This Catholic school effect is very 
large. It is twice as large as the effect of moving from a one- to a two-parent 
family and two and one-half times as large as the effect of raising parents’ 
education from a high school dropout to a college graduate. In models where 
we treat the decision to attend a Catholic school as an endogenous variable, 
we find almost no evidence of selection bias. Bivariate probit estimates of 
the average treatment effect of Catholic schools on high school graduation 
and entering college are very similar to single-equation probit estimates. 

Our bivariate probit model is properly identified if there is at least one 
variable that is correlated with whether or not a student attends a Catholic 
school but is uncorrelated with a student’s unobserved propensity to gradu- 
ate from high school or start college. In most of our work we have used as our 
instrument a dummy variable that equals 1 if the student is from a Catholic 
family and O otherwise. The credibility of our bivariate probit results obvi- 
ously hinges on our assumption that high school students who are Catholic 
are no more likely to graduate from high school or to begin college than stu- 
dents who are not Catholic. As we argue below, once we control for other 
observed factors, it appears that being Catholic is not an important determi- 
nant of most economic outcomes. We also present tests of overidentifying 
restrictions that indicate that our instruments are valid and additional results 
where we use the religious composition of the population in the county where 
a student attends school as an alternative instrument. 

In the next section we describe the HS&B data set and the basic vari- 
ables we have used in our analysis. In Section III we present single-equation 
probit estimates of high school completion and college entrance models. 
In that section we also present a number of sensitivity tests of our single- 
equation model. In Section IV we present bivariate probit models that treat 
the decision to attend a Catholic school as an endogenous variable. We present 
a brief summary and conclusions in the final section of the paper. 


Il. Data 


Most of the data for our study were drawn from the HS&B survey, which 
began in the spring of 1980. The original sample was chosen in two stages. 
Over 1100 secondary schools were selected in the first stage. In the second 
up to 36 sophomores and 36 seniors were selected from each of the sample 
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schools. Certain types of schools, including public schools with high per- 
centages of Hispanic students and Catholic schools with high percentages of 
minority students, were oversampled. The original HS&B sample included 
more than 30,000 sophomores and 28,000 seniors. Follow-up surveys of a 
stratified random sample of the original sophomore cohort were conducted 
in 1982, 1984, and 1986. Our sample is drawn from the 13,683 students who 
were sophomores in 1980 and who were included in both the 1982 and 1984 
follow-ups. We eliminated 389 students who attended private non-Catholic 
schools or whose education level in 1984 is unknown. Thus, our final sample 
includes 13,294 observations. 

HS&B contains information on a wide range of topics including indi- 
vidual and family background, high school experiences, and plans for the 
future. Each student was also given a series of cognitive tests that measured 
verbal and quantitative ability. The sophomore cohort completed these tests 
in the initial 1980 survey and again in the first follow-up in 1982 (when most 
were seniors).* School questionnaires, which were completed by an official 
in each participating school, provided information about dropout rates, staff, 
educational programs, facilities, and services. 

Table I presents definitions and summary statistics for some of the im- 
portant variables we have used in our study.> We classify students as public 
or Catholic school students based on the school they attended as sophomores. 
Our study focuses on two measures of educational attainment: high school 
completion and the decision to begin college. We constructed both variables 
from the 1984 follow-up data when many of the 1980 HS&B sophomores 
would have been out of high school for two years. HIGH SCHOOL GRAD- 
UATE is a dummy variable that equals 1 if the student had completed high 
school by 1984. COLLEGE ENTRANT is a dummy variable that equals 1 if 
the student had enrolled in a four-year college by February of 1984 (and did 
not first enroll in a two-year college or a vocational training program). Since 
graduating high school is a precondition for starting college, all of our work 
defines the COLLEGE ENTRANT variable for only those students who have 
a high school degree.® 

Most of the family characteristics require little explanation. As can be 
seen in Table I, data on family income and parents’ education are missing in 
a significant number of cases. We suspect that these values are missing in a 
nonrandom sample of the population. For example, graduation rates among 
students where the parents’ education is missing are ten percentage points 
lower than the rate for students where the education variable is available.’ 
We looked at a number of strategies to deal with this missing data prob- 
lem including the estimation of a model suggested by Griliches, Hall, and 
Hausman [1978] in which we treat nonreporting as an endogenous variable. 
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Table I. Summary Statistics: High School and Beyond Data Set 


Catholic Public 
school school 
mean and mean and 
Variable name Definition (std. dev.) (std. dev.) 
High School 0-1 dummy variable, = 1 if student 0.97 0.79 
Graduate graduated from high school by (0.17) (0.41) 
February of 1984 
College 0-1 dummy variable, = 1 if first 0.55% 0.324 
Entrant postsecondary school attended (0.50) (0.47) 
was 4-year college 
Catholic 0-1 dummy variable, = 1 if the 0.79 0.29 
Religion student is Catholic (0.41) (0.45) 
% Catholic in Percent of the population in the 31.65 22.37 
County county where the student attends (13.17) (16.82) 
school that is Catholic 
Female 0-1 dummy variable, = 1 if student 0.56 0.50 
is female (0.50) (0.50) 
Black 0-1 dummy variable, = 1 if student 0.15 0.13 
is black (0.36) (0.34) 
Hispanic 0-1 dummy variable, = 1 if student 0.22 0.22 
is Hispanic (0.41) (0.41) 
White 0-1 dummy variable, = 1 is student 0.61 0.58 
is white, non-Hispanic (0.49) (0.49) 
Other Race 0-1 dummy variable, = 1 if student 0.02 0.06 
is other race (0.15) (0.24) 
Family Income 0-1 dummy variable, = 1 if family 0.22 0.23 
Missing income is not reported (0.41) (0.42) 
Family Income 0-1 dummy variable, = 1 if family 0.03 0.07 
< $7000 income < $7000 (0.16) (0.26) 
Family Income 0-1 dummy variable, = 1 if family 0.07 0.11 
$7000-$12,000 income > $7000 and < $12, 000 (0.26) (0.31) 
Family Income 0-1 dummy variable, = 1 if family 0.12 0.15 
$12,000-$ 16,000 income > $12,000 and < $16,000 (0.32) (0.35) 
Family Income 0-1 dummy variable, = 1 if family 0.14 0.15 
$16,000-$20,000 income > $16,000 and < $20,000 (0.35) (0.35) 
Family Income 0-1 dummy variable, = 1 if family 0.16 0.13 
$20,000-$25,000 income > $20,000 and < $25,000 (0.36) (0.33) 
Family Income 0-1 dummy variable, = 1 if family 0.13 0.09 
$25,000—-$38,000 income > $25,000 and < $38,000 (0.33) (0.29) 
Family Income 0-1 dummy variable, = 1 if family 0.14 0.07 
> $38,000 income > $38,000 (0.35) (0.25) 
Parent Education 0-1 dummy variable, = 1 if 0.09 0.19 
Missing parents’ education not reported (0.29) (0.40) 
Parent High 0-1 dummy variable, = 1 if parents’ 0.23 0.30 
School Dropout highest education < high school (0.42) (0.46) 
graduate 
Parent High 0-1 dummy variable, = 1 if parents’ 0.19 0.20 
School Graduate highest education is high school (0.39) (0.40) 
graduate 
Parent Some 0-1 dummy variable, = 1 if parent’s 0.28 0.19 
College highest education is some college (0.45) (0.39) 
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Catholic Public 
school school 
mean and mean and 
Variable name Definition (std. dev.) (std. dev.) 
Parent College 0-1 dummy variable, = 1 if parents’ 0.21 0.11 
Graduate highest education is college (0.41) (0.31) 
graduate 
Single Mother 0-1 dummy variable, = 1 if student’s 0.12 0.15 
household is headed by single (0.32) (0.35) 
mother 
Single Father 0-1 dummy variable, = 1 if student’s 0.03 0.05 
household is headed by single (0.17) (0.21) 
father 
Natural Mother/ 0-1 dummy variable, = 1 if student 0.04 0.06 
Stepfather lives with natural mother and (0.19) (0.24) 
stepfather 
Both Natural 0-1 dummy variable, = 1 if student 0.76 0.62 
Parents lives with both natural parents (0.43) (0.48) 
Other Family 0-1 dummy variable, = 1 if student’s 0.06 0.12 
Structure household has other structure (0.24) (0.32) 
Age 16 0-1 dummy variable, = 1 if student is 0.03 0.03 
< 16 years of age in February of (0.17) (0.17) 
1982 
Age 17 0-1 dummy variable, = 1 if student is 0.63 0.49 
17 years of age in February of 1982 (0.48) (0.50) 
Age 18 0-1 dummy variable, = 1 if student is 0.32 0.40 
18 years of age in February of 1982 (0.47) (0.49) 
Age 19+ 0-1 dummy variable, = 1 if student is 0.02 0.08 
19 years of age or older (0.15) (0.26) 
Attends Religious 0-1 dummy variable, = 1 if student 0.69 0.44 
Services Regularly attends church at least twice a (0.46) (0.50) 
month 
Attends Religious 0-1 dummy variable, = 1 if student 0.17 0.23 
Services Occasionally attends church occasionally (0.38) (0.42) 
Never Attends 0-1 dummy variable, = 1 if student 0.13 0.33 
Religious Services never attends church (0.34) (0.47) 
10th Grade Test Score Student’s sophomore score on stan- 30.06 24.53 
dardized exam (14.63) (15.87) 
Test Score Missing 0-1 dummy variable, = 1 if sopho- 0.08 0.16 
more test score is missing (0.28) (0.37) 
No. of obs. 10,767 2527 


a. The COLLEGE ENTRANT means are conditional on having completed high school. 


In the end we fell back on a straightforward approach of defining income and 
parents’ education in terms of a set of dummy variables and including “miss- 
ing data” as a category. We chose the highest income and highest education 
groups as reference categories in order to facilitate the interpretation of the 


results. 
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Table I shows that, compared with Catholic school students, public 
school students were more than seven times as likely to drop out of high 
school and were just over half as likely to start college. That table also in- 
dicates that the characteristics of Catholic school students suggest that they 
were more likely to succeed in school. Public school students scored lower 
on standardized tests and were far more likely to be eighteen years of age 
or older, to come from low-income families, to have parents who had not 
finished high school, and to live without their father. The basic question in 
this paper is whether Catholic schools still have an important impact on high 
school graduation and college entrance once we control for the effects of 
these measured differences across students as well as any unmeasured dif- 
ferences. Our sample includes significant numbers of Catholic students who 
attend public schools and non-Catholic students who attend Catholic schools, 
thus leaving open the possibility that we can separate the effects of religion 
from the effects of a religious education. 

One simple yet informative test is to compare education outcomes across 
broad demographic and ability groups. These results parallel the discussion 
in Coleman and Hoffer [1987, Chapter 4]. In Table II graduation and college 
entrance rates are computed by ability, family income, parents’ education, 
sex, and race. The table shows that the probability that a public school stu- 
dent will graduate varied dramatically across groups. Among Catholic school 
students, however, these differences were small. For example, the gradua- 
tion rate for public school students whose parents were high school dropouts 
was fourteen percentage points lower than the rate for public school students 
whose parents were college graduates. Among Catholic school students this 
difference was only four percentage points. As a consequence, the difference 
in graduation rates between Catholic and public school students is smallest 
among students with high test scores from high income, well-educated fam- 
ilies. However, even for those groups, Catholic school students graduated at 
higher rates than their public school counterparts. 

As one would expect, there is far more heterogeneity across across de- 
mographic groups in college entrance rates. Across all groups, however, 
Catholic school students were more likely to begin college. As with the 
high school graduation rates, the differences across sectors declines as abil- 
ity, family income, and parents’ education increase, but there are still large 
differences in college matriculation rates even for the top categories in all 
groups.” 


Ill. Probit Models of Educational Attainment 


The literature on the effect of Catholic schools on the probability of graduat- 
ing from high school and going to college has rarely gone beyond the sort of 
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Table II. Educational Outcomes of High School Students by School Type 


HIGH SCHOOL COLLEGE 
GRADUATE ENTRANT" 
Public Catholic Public Catholic 
Sample schools schools schools schools 
Full sample 0.79 0.97 0.32 0.55 
Sophomore Test Score Missing 0.71 0.98 0.22 0.50 
Sophomore Test First Quartile 0.63 0.91 0.11 0.25 
Sophomore Test Second Quartile 0.80 0.96 0.19 0.40 
Sophomore Test Third Quartile 0.89 0.98 0.37 0.56 
Sophomore Test Fourth Quartile 0.95 0.99 0.62 0.78 
Parent Education Missing 0.65 0.92 0.16 0.40 
Parent H.S. Dropout 0.77 0.95 0.22 0.41 
Parent H.S. Degree 0.82 0.97 0.30 0.54 
Parent Some College 0.87 0.98 0.44 0.62 
Parent College Graduate 0.91 0.99 0.61 0.67 
Family Income Missing 0.74 0.97 0.25 0.48 
Family Income < $7000 0.64 0.91 0.19 0.36 
Family Income $7000-$12000 0.76 0.92 0.23 0.44 
Family Income $12000-$16000 0.81 0.98 0.29 0.51 
Family Income $16000-$20000 0.84 0.97 0.33 0.49 
Family Income $20000-$25000 0.84 0.96 0.38 0.57 
Family Income $25000-$38000 0.87 0.99 0.47 0.70 
Family Income > $38000 0.86 0.98 0.52 0.66 
Female 0.80 0.97 0.33 0.53 
Male 0.78 0.97 0.31 0.58 
Black 0.76 0.95 0.33 0.62 
Hispanic 0.76 0.93 0.21 0.45 
White 0.81 0.99 0.35 0.56 
Other Race 0.84 0.98 0.38 0.56 


a. The COLLEGE ENTRANT means are conditional on having completed high school. 


simple cross tabulations in Table II. In this section we extend this literature 
by examining the student’s decision to complete high school or enter college 
by estimating a set of probit models. 


A. Single-Equation Probit Models 


In the high school graduation version of this model, let the indicator variable 
Y; = 1 if student i completes high school, and let Y; = 0 otherwise. The 
choice problem is described by the latent variable model. 


Y*ž = X;B+C)6+6, (1) 
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where Y* is the net benefit a student receives from graduating high school, 
X; is a vector of individual characteristics, C; is a Catholic school dummy 
variable, and €; is a normally distributed random error with zero mean and 
unit variance. Students will only graduate from high school if the expected 
net benefits of completion are positive, and thus the probability that a student 
finishes high school is 


prob[Y; = 1] = prob[ X; p + Cid + €; > 0] = [Xp + Ciô], (2) 


where ®[ ] is the evaluation of the standard normal cdf. 

In all of our high school graduation and college entrance probit mod- 
els, we use the set of individual and family characteristics listed in Table I, 
dummy variables for urban and rural schools, and three indicators for cen- 
sus regions. Maximum likelihood estimates of the high school completion 
and college entrance models are reported in columns 1 and 3 of Table II. 
To measure the qualitative importance of all our right-hand-side variables, 
we report the marginal effect dprob(Y; = 1)/0X; for a reference individual 
in columns 2 and 4.!° For the CATHOLIC SCHOOL dummy variable, we 
also report at the bottom of Table III the “average treatment effect” which 
is the average difference between the probability that a student would grad- 
uate from high school if he or she attended a Catholic high school and the 
probability that student would graduate if he or she attended a public school. 
Thus, if n is the sample size and 6 and ô are the maximum likelihood es- 
timates of the parameters in equation (2), then the average treatment ef- 
fect equals (1/n) }°;,[®(X;B + 5) — ®(X;B)]. We use the “delta” method 
to calculate the variance of the marginal effects and average treatment 
effects. 

The results in Table III show that Catholic school students have a sub- 
stantially higher probability of completing high school and entering a four- 
year college than do public school students. Our reference individual’s prob- 
ability of finishing high school would be twelve percentage points higher 
if she went to a Catholic school than if she went to a public school. The 
probability that she would enter college would be fourteen percentage points 
higher. To place these results in perspective, the impact of Catholic schools 
on high school completion is more than two and one-half times larger than 
the effect of moving from the lowest to the highest income group, 50 per- 
cent larger than the effect of moving from the lowest to the highest parents’ 
education category, and three times as large as the impact of moving from 
a family headed by a single female to a two-parent family. The estimated 
marginal effects for CATHOLIC SCHOOL reported in Table III are roughly 
equal to the average treatment effects for the entire sample.'! 
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Table II. Probit Estimates of HIGH SCHOOL GRADUATE and COLLEGE 


ENTRANT Models 
HIGH SCHOOL COLLEGE 
GRADUATE ENTRANT 

Probit Marginal Probit Marginal 

Independent variable“ coefficient effect? coefficient effect? 

Catholic School 0.777 0.117 0.384 0.144 
(0.056) (0.014) (0.032) (0.012) 

Female 0.041 0.006 0.021 0.008 
(0.029) (0.004) (0.026) (0.010) 

Black 0.132 0.020 0.170 0.064 
(0.045) (0.007) (0.042) (0.014) 

Hispanic 0.080 0.012 —0.160 —0.060 
(0.037) (0.006) (0.036) (0.014) 

Other Race 0.346 0.052 0.316 0.118 
(0.067) (0.011) (0.060) (0.022) 

Family Income Missing 0.111 0.017 0.382 —0.143 
(0.068) (0.010) (0.055) (0.021) 

Family Income < $7000 —0.300 0.045 —0.484 -0.181 
(0.078) (0.012) (0.080) (0.030) 

Family Income 0.121 0.018 0.408 0.153 
$7000-$12,000 (0.073) (0.011) (0.063) (0.024) 

Family Income 0.035 0.005 0.319 0.119 
$12,000-$ 16,000 (0.072) (0.011) (0.056) (0.021) 

Family Income 0.000 0.000 —0.283 —0.106 
$16,000-$20,000 (0.070) (0.010) (0.055) (0.020) 

Family Income 0.035 0.005 0.196 0.073 
$20,000-$25,000 (0.072) (0.011) (0.055) (0.021) 

Family Income 0.037 0.006 —0.025 —0.009 
$25,000-$38,000 (0.077) (0.012) (0.057) (0.021) 

Parent Education Missing -0.730 0.110 0.916 0.342 
(0.061) (0.013) (0.052) (0.020) 

Parent High School Dropout 0.522 0.078 —0.855 —0.320 
(0.058) (0.011) (0.043) (0.017) 

Parent High School Graduate 0.375 0.056 0.602 0.225 
(0.060) (0.011) (0.044) (0.015) 

Parent Some College —0.204 —0.031 —0.290 —0.108 
(0.062) (0.010) (0.042) (0.016) 

Single Mother -0.255 —0.038 —0.060 —0.023 
(0.041) (0.007) (0.042) (0.016) 

Single Father —0.421 —0.063 —0.269 -0.101 
(0.063) (0.010) (0.069) (0.026) 

Natural Mother/Stepfather —0.286 0.043 0.263 0.098 
(0.056) (0.009) (0.060) (0.023) 

Other Family Structure 0.155 0.023 —0.060 0.023 
(0.048) (0.007) (0.053) (0.020) 

Age 16 0.611 0.092 0.655 0.245 
(0.089) (0.015) (0.115) (0.043) 
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HIGH SCHOOL COLLEGE 
GRADUATE ENTRANT 
Probit Marginal Probit Marginal 
Independent variable“ coefficient effect? coefficient effect? 
Age 17 1.025 0.154 0.718 0.268 
(0.050) (0.014) (0.087) (0.033) 
Age 18 0.699 0.105 0.603 0.225 
(0.050) (0.012) (0.088) (0.033) 
Attends Religious Services 0.321 0.048 0.299 0.112 
Regularly (0.035) (0.006) (0.035) (0.014) 
Attend Religious Services 0.082 0.012 0.115 0.043 
Occasionally (0.039) (0.006) (0.041) (0.015) 
Intercept 0.388 —0.683 
(0.093) (0.107) 
Average treatment effect of 0.130 0.132 
Catholic School (0.007) (0.011) 
Log Likelihood -5155.26 -3297.87 


Asymptotic standard errors are in parentheses. The number of observations in the HIGH 
SCHOOL GRADUATE and COLLEGE ENTRANT models is 13,294 and 10,983, respectively. 
a. Other exogenous variables include dummy variables for urban and rural schools, plus three 
regional dummy variables. 

b. Marginal effects are calculated for a seventeen—year old white female, living with both 
natural parents where at least one parent has a high school degree and family income is 
between $16,000 and $20,000, attends church regularly, and lives in a suburban area in the 
south. 


The other results in Table III are consistent with the literature in this 
field. Females, students from wealthier families, students with better ed- 
ucated parents, and students living with both natural parents are all more 
likely to graduate from high school and enter college. Students who are at 
least eighteen are far more likely to drop out of high school, largely because 
these students are more likely to have repeated a grade, a clear signal that they 
have struggled in school. The results on student age may also reflect, in part, 
the fact that compulsory education laws are not binding for older students 
[Angrist and Krueger 1991]. The effects of family income on high school 
graduation is large for students from families with incomes below $12,000 
(conditional on parents’ education), but increases in income beyond $12,000 
seem to have little additional impact on the chances that a student will grad- 
uate. In contrast, the probability of college entrance increases monotonically 
as income rises. The results also show that although in the raw data blacks 
and Hispanics drop out at higher rates than do whites, once we control for 
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observed characteristics these groups are actually more likely to finish high 
school. 


B. Potential Omitted Variables Bias 


In this section we ask whether our basic results are robust. Our primary 
concern here is that we have omitted important (measurable) characteris- 
tics of the student that are correlated with the Catholic school variable and 
that, as a consequence, we have overstated the benefits of a Catholic school 
education. The results of some of these sensitivity tests are shown in 
Table IV. We reproduce the basic results from Table II in the first line of 
Table IV. 

We begin by asking whether including measures of student ability or 
achievement would change our basic finding. While we would certainly ex- 
pect to find that better students are more likely to finish high school and 
start college, we are hesitant to include measures of ability or achievement 
in our basic model since they are potentially endogenous variables. Here 
we set these concerns aside for the moment and include in line (2) the stu- 
dent’s sophomore score on the HS&B exams in the basic probit models. Not 
surprisingly, test score is an excellent predictor of both measures of educa- 
tional attainment. The t-statistic on the test score variable is over 13 in both 
models. Including test score reduces the average treatment effect of Catholic 
schools from 13.0 percentage points in the dropout model to 10.0 and from 
13.2 to 11.1 in the college model. While the effect of Catholic schools is 
still large in the second line of Table IV, we would argue that these models 
probably understate the true effect of Catholic schools. The sophomore test 
score is missing for over 1900 students. It is more likely to be missing for 
public school students and for students with the highest ex post probability 
of dropping out.'? Excluding these observations from the data set would then 
drag the Catholic school coefficient downward. To illustrate this point more 
clearly, in line (3) we set the test score equal to zero if the score is missing 
and include a dummy that equals 1 if the score is missing but equals 0 other- 
wise. In this specification, including test scores has little impact on our basic 
conclusions. The average treatment effects in line (3) are very close to the 
average treatment effects in line (1).!° 

We noted above that Catholic school students are more likely to come 
from two-parent, high income, well-educated families; i.e., they have “bet- 
ter” observed characteristics. Moreover, they attend schools with peers who, 
on average, also have better observed characteristics. A number of authors 
have found that a range of social outcomes is correlated with the quality 
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of the peer group.'* Therefore, it is possible that we have overstated the ef- 
fect of Catholic schools by ignoring peer group effects. We have calculated 
a set of seven peer group measures for each school in our sample using data 
from all students in the first wave of HS&B (and thus in many cases these 
peer group measures are based on 72 students). Our peer group measures 
equal the proportion of students in a school whose parents fall into four ed- 
ucation categories and whose family falls into three income categories.!> In 
line (4) of Table IV we include these peer group measures in our basic probit. 
Although a number of the peer variables are statistically significant and ,in- 
dicate that better peer groups do increase the probability of completing high 
school and entering college, the coefficients on the CATHOLIC SCHOOL 
dummy variable and the average treatment effects change very little.'® 

A number of previous studies have found that measures of the family’s 
inputs to education are important determinants of a student’s score on stan- 
dardized exams [Coleman, Hoffer, and Kilgore 1982; Coleman and Hoffer 
1987; Noell 1982]. Coleman, for example, includes indicators for whether 
the student’s family owns a calculator, an encyclopedia, more than 50 books, 
or a typewriter. As the results in line (5) indicate, including these vari- 
ables does reduce the impact of a Catholic school education, but the Catholic 
school effect remains quite large. However, as with the test score data, there 
are many missing observations for these variables. Letting the indicator vari- 
ables equal zero if the value is missing and including four dummy variables 
that equal one if the variable is missing, we see in line (6) that these four 
family measures have little impact on the average treatment effect.!” 

Given the variation in state labor market conditions, compulsory school- 
ing laws and state support for higher education, it is possible that there are 
strong state effects in the models we have estimated. If these state effects are 
correlated with the probability of attending a Catholic high school, they may 
have led us to overstate the impact of a Catholic education on educational 
attainment. HS&B does not identify the state in which a student lives. We 
can, however, identify all of the students who live in the same state (although 
we do not know which state that is). The Local Labor Market Indicators 
for HS&B (1980-1982) supplemental file reports local labor market statis- 
tics at the county, MSA, and state level for the years 1980-1982. There are 
51 unique values for the product of all state level unemployment rates for the 
three years. In line (7) of Table IV we include 50 state dummy variables in 
the basic probit models. The marginal and average treatment effects in this 
fixed-effects model are very similar to the estimates in line (1).!8 

Finally, we run one large model that includes the test scores and a 
dummy for missing test scores, the seven peer group measures, the four 
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measures of home inputs into education and indicators for missing values, 
and 50 state dummy variables. Including all 67 of these variables decreases 
the average treatment effect of a Catholic education on high school com- 
pletion and college entrance by 8 and 17 percent, respectively. For both 
dependent variables, however, the average treatment effect is still more than 
ten percentage points. Our results, therefore, appear to be robust to rather 
different model specification. 


C. Catholic School Selectivity 


Public schools must accept virtually all students who live within their atten- 
dance boundaries, and in general it is very difficult for most public schools 
to expel a student. Catholic schools, on the other hand, are free to select their 
students and to expel students because of poor behavior or poor academic 
performance. Thus, part of the Catholic school effect we have found could 
be due to the way Catholic schools choose their students. They are in a better 
position than public schools to avoid students who in the end are likely to 
drop out.!? 

The bivariate probit models we present in the next section of the paper 
can address this question. But we can also present some evidence on this 
point within our single-equation framework. HS&B asked school officials 
whether their schools used entrance exams as part of the admissions pro- 
cess and whether there was a waiting list for the school. If school selection 
does play an important role in explaining the success of Catholic schools, 
then we would expect Catholic schools that use entrance exams or that have 
waiting lists to have lower dropout rates than other Catholic schools. To test 
this hypothesis, we interacted the Catholic school dummy variable with these 
school characteristics. The results are presented in Table V. In both instances 
we do not find a pattern that is consistent with the school selection hypoth- 
esis. In all of the models in Table V, we are unable to reject the hypothesis 
that there is no difference in graduation or college entrance rates across types 
of Catholic schools. 


D. Definition of the Dependent Variables 


As a final sensitivity test in this section, we asked whether our results are 
robust to alternative definitions of the dependent variables. We have rees- 
timated our models allowing for more inclusive measures of high school 
graduation and college completion. For example, we have estimated mod- 
els where we count those with GED’s and those who received diplomas after 
February of 1984 as high school graduates. Counting these students as high 
school graduates increases the sample average graduation rate to 90.4 percent 
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and decreases the Catholic school average treatment effect to eight percent- 
age points. Given the recent work of Cameron and Heckman [1993], who 
found that students earning a GED have poorer labor market outcomes than 
regular high school graduates, it is not clear that equating these two groups 
is appropriate. We also counted those who entered two-year colleges and 
those entering any college after February of 1984 as college entrants. This 
change in definition increases the mean of the dependent variable to 60 per- 
cent, but the Catholic school average treatment effect remains roughly twelve 
percentage points.”° 


IV. Testing for Selectivity Bias 


All of the single-equation models we presented in the previous section treat 
the decision to attend Catholic schools as exogenous. As Goldberger and 
Cain [1982] and others argue (and Coleman acknowledges), selectivity bias 
is potentially the most serious problem in the literature on the effectiveness 
of private schools. The following example illustrates the nature of the error 
that could arise. Consider a child whose parents care a great deal about his 
welfare. We would expect this child to do well in school for two reasons. 
First, his parents will see that he attends a better than expected school and will 
be more willing to pay the cost of sending him to a private school. Second, 
he will succeed in part because of factors that cannot be observed but are 
under his parents’ control. They will spend more time reading to him, they 
will stress the importance of good grades, and they will see that he does his 
homework. A single-equation model would mistakenly attribute all of this 
child’s success to his private school. More formally, our results would be 
biased because the school choice variable in the high school completion and 
college entrance equations would be correlated with the error term. Similar 
problems will arise if Catholic schools are able to screen potential students 
on factors such as a personal interview or they expel students on the basis of 
poor behavior and academic performance. 


A. A Bivariate Probit Model 


In this section we outline a simple bivariate probit model that allows for these 
possibilities. Following the latent variable model in equation (1), suppose 
that the net benefits of attending Catholic school C; can be written as 


C} = Ziy + Hi, (3) 


where Z; is a vector of observables and m; is a random error. A family 
will enroll a child in a Catholic school if the net benefits are positive; i.e., if 
C* > 0. To allow for the possibility that the unobserved determinants of a 
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student’s performance and the unobserved determinants of a family’s deci- 
sion to enroll their teenager in a Catholic school are correlated, we assume 
that e; and u; are distributed bivariate normal, with E[e;] = E[u;] = 0, 
var[e;] = var[u;] = 1 and cov[e;, ui] = p. Because both decisions we 
model are dichotomous, there are four possible states of the world (Y; = 0 
or 1 and C; = O or 1). The likelihood function corresponding to this set of 
events is therefore a bivariate probit. 

This system is identified if at least one variable in Z; is not contained 
in X;. Initially, we use as our instrument a dummy variable CATHOLIC 
RELIGION that equals 1 if the student reports that she is Catholic and 0 oth- 
erwise. Subsequently, we consider alternative instruments such as whether 
a student attends school in a predominantly Catholic area and a set of in- 
struments that we form by interacting CATHOLIC RELIGION with religious 
attendance variables. We look at the validity of these variables as instruments 
below. 

The bivariate probit results are summarized in Table VI. We repeat the 
basic single-equation results from Table III in lines (1) and (6) of Table VI. 
In lines (2) and (7) we present the maximum likelihood (MLE) bivariate pro- 
bit estimates using CATHOLIC RELIGION as an instrument and the same 
right-hand variables we use in the basic single-equation models. In both the 
high school graduate and college entrance models, the MLE estimates of 
the marginal effect of Catholic schools and the average treatment effect are 
quite close to the single-equation estimates. The MLE estimate of the cor- 
relation coefficient p is negative in the high school completion model and 
positive in the college model, but in both cases the estimate is small, impre- 
cise, and thus statistically insignificant. 

In the remainder of Table VI we look at the impact of adding state effects 
and tenth grade test scores (variables that appeared to be important when we 
looked at them in Table IV) to the bivariate probit model. These additional 
variables have little impact on our basic conclusions in the dropout model. 
The estimated average treatment effect in lines (3)—(5) is similar to the av- 
erage treatment effect in (2). Our estimates of p are always statistically in- 
significant. Adding tenth grade test scores to the college models (regardless 
of whether we include state effects as well) reduces the average treatment ef- 
fect and leads to an estimate of o which is positive and significantly different 
from zero. Even in these models, however, attending a Catholic high school 
increases the probability of entering college by more than seven percentage 
points. 

The last column of Table VI presents estimates of a somewhat different 
econometric model. Although the bivariate probit model is straightforward 
to estimate, the model is substantially more complicated than a standard 
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Table VI. Maximum Likelihood Estimates of HIGH SCHOOL GRADUATE 
and COLLEGE ENTRANT Bivariate Probit Model Using CATHOLIC 
RELIGION as an Instrument 


MLE estimates of bivariate 
probit model 


Coefficient 2SLS estimate 
on Average of coefficient 
Other variables? CATHOLIC Marginal treatment on CATHOLIC 
Model in X; SCHOOL effect® effect p SCHOOL 
High School Graduate“ 
(1) 0.777 0.117 0.130 0.0964 
(0.056) (0.014) (0.007) (0.008) 
(2) 0.859 0.133 0.141  -0.053 0.127 
(0.115) (0.022) (0.014) (0.067) (0.024) 
(3) 10th Grade Test Score and 0.678 0.078 0.114 0.028 0.103 
Test Missing (0.126) (0.018) (0.017) (0.072) (0.024) 
(4) State Effects 0.911 0.142 0.144  —0.050 0.114 
(0.121) (0.027) (0.015) (0.072) (0.024) 
(5) 10th Grade Test Score, 0.746 0.124 0.121 0.025 0.134 
Test Missing, and State (0.132) (0.028) (0.016) (0.077) (0.030) 
Effects 
College Entrant“ 
(6) 0.384 0.144 0.132 0.1374 
(0.032) (0.012) (0.011) (0.011) 
(7) 0.288 0.109 0.098 0.067 0.148 
(0.079) (0.033) (0.028) (0.049) (0.030) 
(8) 10th Grade Test Score and 0.211 0.078 0.064 0.124 0.098 
Test Missing (0.083) (0.034) (0.026) (0.052) (0.024) 
(9) State Effects 0.341 0.110 0.115 0.056 0.092 
(0.084) (0.032) (0.029) (0.053) (0.024) 
(10) 10th Grade Test Score, 0.277 0.071 0.082 0.113 0.098 
Test Missing, and State (0.090) (0.026) (0.027) (0.046) (0.028) 


Effects 


Asymptotic standard errors are in parentheses. 

a. Models (1) and (6) are single-equation estimates from Table III. To estimate models (4), (5), (9), 
and (10), we deleted all states with no Catholic school students. The high school completion and 
college entrance models contain 10,120 and 8470 observations, respectively. Both models contain 
data from twenty states. Models (1), (2), and (3) contain 13,294 observations, and models (6), (7), 
and (8) contain 10,983 observations. 

b. Other exogenous variables include those listed in Table III 

c. Marginal effects are calculated for the individual defined in Table III 

d. Estimated CATHOLIC SCHOOL coefficient from a linear probability model. 


two-stage least squares (2SLS) model one could estimate if all potentially 
endogenous variables were continuous. Fortunately, Angrist [1991] has 
shown that instrumental variable estimation is a viable alternative to the bi- 
variate probit model. In the notation of equation (1) Angrist showed in a 
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Monte Carlo study that if we ignore the fact that the dependent variable is 
dichotomous and estimate 


Y; = Xip + Ciô + €i (4) 


with instrumental variables (IV), the IV estimate of 5 is very close to the 
estimated average treatment effects calculated in a bivariate probit model. A 
comparison of the third and fifth columns of Tables VI illustrate the Angrist 
result. The 2SLS estimates of the Catholic school effect and the average 
treatment effect are very similar in all of the models we have presented in 
that table. We will take advantage of this result below where we focus on the 
validity of our instruments. 


B. The Validity of the Instruments 


If CATHOLIC RELIGION is a valid instrument, then (i) it must be a deter- 
minant of the decision to attend a Catholic School, but (ii) it must not be a 
determinant of the decision to drop out of high school or to start college; i.e., 
it must not be correlated with the error term e;. Not surprisingly, it is easy to 
show that it meets the first test. In a probit model that explains the probabil- 
ity a student will attend a Catholic school, the t-statistic on the CATHOLIC 
RELIGION variable is 36.3. In a simple OLS model where CATHOLIC 
SCHOOL is regressed on CATHOLIC RELIGION, the R? is 0.16. 

Thus, the credibility of our bivariate probit results turns on our as- 
sumption that high school students who are Catholic are no more likely to 
graduate from high school or to begin college than otherwise identical stu- 
dents who are not Catholic. There is little evidence from other studies that 
would suggest that there are important differences in the education levels of 
Catholics and non-Catholics. Taubman [1975, Table 3, p. 179], for example, 
found that the level of education of Jews and Protestants was not significantly 
different from the level of education of Catholics. Using the data appendix 
in Tomes [1984], we find that Catholics and non-Catholics have virtually the 
same average years of education (12.88 versus 12.64, respectively). How- 
ever, in the raw HS&B data (that is, without accounting for variables that are 
correlated with the Catholic religion variable), Catholic students are more 
likely to finish high school and to go to college. In the full sample, 88.4 per- 
cent of Catholics graduated from high school as compared with 79.0 percent 
of non-Catholics. Among students who finished high school, 42.8 percent of 
Catholics entered college as compared with 33.5 percent of non-Catholics. 
These differences could lead us to estimate of the effect of a Catholic school 
education that is large but possibly misleading. 

The following simple calculation makes this point clear. With our dis- 
crete instrument and assuming a bivariate linear model where the only 
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right-hand-side variable is CATHOLIC SCHOOL, we can generate an instru- 
mental variable estimate for the CATHOLIC SCHOOL effect through a com- 
parison of means. Using the results in Wald [1940], the instrumental vari- 
able estimate is simply the difference in graduation rates for Catholics and 
non-Catholics, divided by the difference in the probability that Catholics and 
non-Catholics attend Catholic high schools. In the full sample, 39.1 percent 
of Catholics and 6.4 percent of non-Catholics go to Catholic schools. Thus, 
the Wald instrumental variable estimate for the impact of Catholic schools in 
the dropout model is (.884 — .790)/(.392 — .064) = .287. For the sample 
that has completed high school, 43.1 percent of Catholics and 7.8 percent of 
non-Catholics are in Catholic high schools, implying a Wald estimate for the 
college entrance model of (.428 — .335)/(.431 — .078) = .263. 

These raw numbers suggest that, on average, Catholics are better ed- 
ucated than non-Catholics. This will pose a problem for our estimation if, 
after controlling for other observed characteristics, the Catholic religion in- 
strument is correlated with a student’s unobserved propensity to graduate 
from high school or enter college. The most straightforward way to address 
this issue is to include CATHOLIC RELIGION in the single-equation probits 
we discussed in Table HII. We recognize that this is not a formal test since 
if the correct specification is a bivariate probit then single-equation models 
are misspecified, but it does offer a clear sense of the patterns in the data. If 
we include CATHOLIC RELIGION in a single-equation dropout model, its 
estimated coefficient is positive but statistically insignificant. The estimated 
marginal effect of the CATHOLIC RELIGION variable in that model is very 
small compared with the effect of going to a Catholic school. Although this 
is not a direct test of whether our instrument is valid, it does indicate that, as 
a group, Catholics are no different from non-Catholics. 

We performed three further tests in order to explore this issue. First, we 
have constructed additional sets of instruments that recognize that there is 
heterogeneity in the demand for Catholic schools among Catholics. These 
models, for example, allow for the possibility that Catholics who attend 
church regularly are more likely to send their children to Catholic schools 
than are Catholics who rarely go to church. Second, following Neal [1994] 
and Hoxby [1994], we have used a very different instrument: the propor- 
tion of the population in the county where a student attends school that is 
Catholic.?! They argue that it is probable that there will be more Catholic 
schools in predominantly Catholic areas and thus students (given their ob- 
servable characteristics) who live in such areas are more likely to attend a 
Catholic school.”? There is no reason, however, to suspect that the proba- 
bility that a student will finish high school or start college depends on her 
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neighbors’ religion. Third, we have formed a final set of instruments by 
combining the Catholic religion and Catholic population variables. The mod- 
els, like the models that incorporate church attendance, allow for heterogene- 
ity among Catholics (e.g., Catholics who live in heavily Catholic neighbor- 
hoods are more likely to send their children to Catholic schools). 

This research strategy is particularly attractive since it leads to several 
models that are overidentified. In those models, we can use Newey’s [1985] 
method of moments specification tests to look at the internal consistency of 
the model; i.e., whether the variables we use as instruments can be excluded 
from the structural equation. In a 2SLS model the test statistic is constructed 
by regressing the estimated errors from the structural model of interest on 
all exogenous variables in the system. The number of observations times the 
uncentered R? from this synthetic regression is distributed as x? with de- 
grees of freedom equal to the number of instruments minus the endogenous 
right-hand-side variables in the structural equation of interest. Here again, we 
recognize that this is not a proper formal test. Although the Angrist [1991] 
result allows us to accurately estimate the average treatment effect via 2SLS, 
it is not clear that the assumptions necessary to perform the tests of overi- 
dentifying restrictions are met when both Y and C are discrete. This class of 
tests, however, is the best available diagnostic. 

Table VII summarizes the estimates of models that rely on these alter- 
native instruments. All of the models include the exogenous variables that 
we included in the basic versions of our probits presented in Table IHI. In 
lines (1) and (7) we repeat the estimates of the Catholic school effect from 
lines (2) and (7) in Table VI. For the HIGH SCHOOL GRADUATE mod- 
els, we first interact Catholic religion with the religious attendance variables. 
Next, we use % CATHOLIC IN COUNTY as an instrument. We next use both 
CATHOLIC RELIGION and % CATHOLIC IN COUNTY as instruments, and 
then add the interaction of these variables to the previous model. Finally, 
in line (6) we use % CATHOLIC IN COUNTY as our instrument and include 
CATHOLIC RELIGION as a covariate in both the Catholic school and dropout 
equations. 

Our estimates of the Catholic school effect from the bivariate probit 
models in lines (1)-(5) fall between 0.114 and 0.141. The 2SLS estimates 
are quite similar to the bivariate probit estimates in all cases. We cannot 
construct a test of overidentifying restrictions for the models in lines (1) and 
(3) since those models are exactly identified. For the other three models, 
however, all test statistics are well below their 95 percent critical value. The 
2SLS estimate of the Catholic school effect in line (6) is consistent with 
our other estimates, though this effect is measured imprecisely (the standard 
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Table VII. System Estimates of HIGH SCHOOL GRADUATE and 
COLLEGE ENTRANT Models with Alternative Instruments 


Bivariate 
probit estimates 
of average Test of 
treatment 2SLS overidentifying 
effect, estimate of restrictions, 
CATHOLIC CATHOLIC (d.o.f.), [95% 
Instruments SCHOOL SCHOOL critical value] 
High School Graduate“ 
(1) Catholic Religion 0.141 0.127 
(0.014) (0.024) 
(2) Catholic Religion x Attendance 0.141 0.107 3.29 (2) 
at Religious Services (0.013) (0.022) [5.99] 
(3) 9% Catholic in County 0.114 0.130 
(0.033) (0.076) 
(4) Catholic Religion and 0.139 0.127 0.10 (1) 
% Catholic in County (0.044) (0.024) [3.84] 
(5) Catholic Religion, % Catholic in 0.137 0.127 0.84 (2) 
County and Catholic Religion (0.014) (0.024) [5.99] 
x % Catholic in County 
(6) % Catholic in County? 0.061 0.144 
(0.038) (0.373) 
College Entrant“ 
(7) Catholic Religion 0.098 0.148 
(0.028) (0.030) 
(8) Catholic religion x Attendance at 0.122 0.167 6.3 (2) 
Religious Services (0.127) (0.027) [5.99] 
(9) 9% Catholic in County 0.240 0.656 
(0.053) (0.093) 
(10) Catholic religion and % Catholic 0.115 0.161 33.7 (1) 
in County (0.037) (0.029) [3.84] 
(11) Catholic Religion and Catholic 0.071 0.104 0.81 (1) 
Religion x % Catholic in County® (0.028) (0.031) [3.84] 


Asymptotic standard errors are in parentheses. The number of observations in the HIGH 
SCHOOL GRADUATE and COLLEGE ENTRANT models is 13,294 and 10,983, respectively. 
a. Other exogenous variables include those listed in Table III. 

b. CATHOLIC REIiGION is included as an exogenous variable in the model. 

c. % CATHOLIC IN COUNTY is included as an exogenous variable in the model. 


error is more than ten times as large as the standard errors in most of the 
first five models). The bivariate probit estimate of model (6) is somewhat 
smaller than the other estimates in the upper panel of Table VII. It thus ap- 
pears that our graduation results are fairly robust, though the results where 
we depend on CATHOLIC RELIGION as an instrument are estimated more 
precisely. 
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The COLLEGE ENTRANT models in lines (7) through (10) parallel 
the graduation models in lines (1) through (4). The COLLEGE ENTRANT 
models are much more sensitive to the choice of instruments than are the 
HIGH SCHOOL GRADUATE models. In particular, versions of the model 
that use % CATHOLIC IN COUNTY as an instrument sometimes lead to re- 
sults that are substantially different from the results we reported earlier. For 
example, in line (9) where we use % CATHOLIC IN COUNTY as the sin- 
gle instrument, the 2SLS estimate of CATHOLIC SCHOOL is implausibly 
large. The tests of overidentifying restrictions in the college model where 
we interact CATHOLIC RELIGION with the religious attendance variable is 
slightly larger than the critical value (the p-value is approximately 0.043), but 
the college model in line (10) clearly rejects the null hypothesis of internal 
consistency. 

We suspect that the problem is that Catholics are likely to live in states 
where large numbers of students go on to college. To test this hypothesis, 
we used the data files from the 1980-1982 October Current Population Sur- 
veys and calculated state-level averages of the percent of 18 to 22 year-olds 
who are enrolled in college. The raw correlation between these values and 
the percent of the population in a state that is Catholic is 0.38 (p-value of 
0.006). Because % CATHOLIC IN COUNTY may be capturing some unob- 
served state characteristics in the college models, in line (11) we included it 
as an exogenous variable and use CATHOLIC RELIGION and the interaction 
CATHOLIC RELIGION and % CATHOLIC IN COUNTY as instruments. In 
that model the estimated average treatment effect is 10.4 percent, and the 
statistic required for the test of overidentifying restrictions is well below the 
95 percent critical value. 


C. Heterogeneity in the Catholic School Effect 


We have also explored the impact of Catholic schools on different subgroups 
of our sample, and thus, for example, we have estimated separate models for 
blacks and whites and Catholics and non-Catholics. When we divide the sam- 
ple into Catholics and non-Catholics, we clearly cannot use CATHOLIC RE- 
LIGION as an instrument and thus must rely on % CATHOLIC IN COUNTY 
to identify those bivariate probit models. As we showed in Table VII, 
% CATHOLIC IN COUNTY led to several implausible results in the college 
models. We therefore focus on high school graduation in this section of the 
paper. 

Table VIII presents estimates of the average treatment effect of a 
Catholic school education for various subgroups. In the single-equation pro- 
bits and bivariate probits where we use CATHOLIC RELIGION as an in- 
strument, Catholic schools have a larger impact on students who have the 
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Table VII. Heterogeneity of the Average Treatment Effect, HIGH SCHOOL 
GRADUATE Models 


Average treatment effect, 
CATHOLIC SCHOOL* 


Bivariate probit estimates 
with instructions: 


Mean HIGH Single- 
Number SCHOOL equation % CATHOLIC CATHOLIC 


Sample of obs. GRADUATE probit IN COUNTY RELIGION 
White 7831 0.826 0.141 0.086 0.128 
(0.007) (0.039) (0.016) 
Black 1833 0.803 0.134 0.111 0.146 
(0.019) (0.101) (0.044) 
Urban? 3150 0.774 0.172 0.139 0.184 
(0.016) (0.069) (0.037) 
Suburban 6696 0.862 0.109 0.003 0.120 
(0.008) (0.052) (0.017) 
Sophomore Test, 2842 0.658 0.213 0.113 0.242 
First Quartile (0.025) (0.145) (0.051) 
Sophomore Test, 2842 0.829 0.105 0.128 0.110 
Second Quartile (0.016) (0.066) (0.087) 
Sophomore Test, 2854 0.916 0.069 0.176 0.071 
Third Quartile (0.010) (0.039) (0.020) 
Sophomore Test, 2841 0.960 0.030 0.217 0.012 
Fourth Quartile (0.007) (0.188) (0.031) 
Catholic 5104 0.884 0.107 0.328 
(0.008) (0.033) 
Non-Catholic 8190 0.790 0.145 0.072 
(0.013) (0.098) 


Asymptotic standard errors are in parentheses. 

a. Other exogenous variables include those listed in Table III. 

b. Schools in the South were deleted from this subsample because there were no urban Catholic 
schools. 


lowest probability of finishing high school: blacks, students in urban areas, 
and students with low test scores. We still find, however, a large, statistically 
significant Catholic school effect for white and suburban students. These re- 
sults are in contrast to Neal [1994], who found that Catholic schools raise the 
probability that urban black students will graduate but have little impact on 
other groups of students. 

Some of these patterns emerge in bivariate probits where we use 
% CATHOLIC IN COUNTY as an instrument, though in general, these models 
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are estimated less precisely. The effect on black and white students is similar, 
but the average treatment effect for blacks is not significantly different from 
zero. The pattern across test score groups is difficult to interpret, and the 
Catholic school effect for Catholics is implausibly large. In all, these results 
and the COLLEGE ENTRANT results in Table VII lead us to conclude that 
while the argument in favor of using % CATHOLIC IN COUNTY to identify 
the bivariate probit models is quite plausible, the actual gains from doing so 
are not as clear as we had first hoped.”? 


V. Summary and Conclusions 


Spurred by the work of Coleman et al., academics and policymakers have 
been involved in a decade-long debate over the relative effectiveness of pub- 
lic and private schools. This debate has been waged largely over a single 
outcome measure: standardized test scores. But, as Card and Krueger [1992, 
p. 37] have argued, “success in the labor market is at least as important a 
yardstick for measuring the performance of the educational system as stan- 
dardized tests.” In this paper we have looked at two measures of education 
that are clearly linked to virtually every measure of success in the labor mar- 
ket: the decisions to finish high school and go to college. We find that teens 
enrolled in Catholic schools have a significantly higher probability of com- 
pleting high school and starting college, that the results appear to be robust, 
and that we cannot attribute the differences between sectors to sample selec- 
tion bias. Catholic schools appear to have particularly large effects for ur- 
ban students. This result has some potentially important policy implications 
given the concern over the quality of public schools in many inner cities. 
Most of our conclusions are consistent with other work on this problem in- 
cluding Neal [1994], who uses a different data set but a similar econometric 
approach, and Sander and Krautmann [1995] (which we learned of only after 
finishing the research for this paper), who use the same data set, a somewhat 
different econometric approach, and different instruments. 

Our research leaves open a number of questions. First, it is possible that 
further analysis of the HS&B data or other data will make the Catholic school 
effect go away. For example, perhaps we have missed an important omitted 
variables problem or possibly a different approach to selectivity bias will 
yield different conclusions. Second, if Catholic schools are as effective as 
our results suggest, then we are left with a puzzle: why do not more families 
(particularly lower income Catholic families) make a fairly modest invest- 
ment and send their children to a Catholic school? Third, if Catholic schools 
are more effective than public schools, we need to know more about the 
source of their effectiveness. Coleman et al. attribute this success to Catholic 
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schools’ emphasis on discipline, attendance, and homework. Our research 
does not address this issue, but it is an obvious next step. Finally, we need 
to know whether it will ever be possible to apply the lessons we learn from 
the Catholic schools to nonreligious private schools. In some ways, Catholic 
schools are like other private schools—they must meet the test of the mar- 
ket. But in other ways they are obviously fundamentally different, and it 
is not clear that they succeed because of the importance of religion or the 
discipline of competition.”4 
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Notes 

1. Henig [1994], for example, found that out of 125 questions in HS&B 
dealing with vocabulary, reading, mathematics, science, writing, and civics, 
public school students improved by 7.16 items (from 67.07 as sophomores 
to 74.23 as seniors), while Catholic school students improved by 8.98 items. 
Thus, even before accounting for differences in family characteristics, Cole- 
man’s Catholic school effect represents only 8.98 — 7.17 = 1.81 additional 
correct answers. 

2. For example, Chubb and Moe [1990], in their 318-page analysis of 
effective schools, use test scores as virtually their sole measure of school 
performance. Coleman does discuss differences in dropout rates briefly, but 
the analysis is limited to simple cross tabulations of the data. Neal [1994] 
and Sander and Krautmann [1995] are similar in some ways to this paper. 

3. For a review of the effects of cognitive development on labor market 
performance, see Hanushek, Rivkin, and Jamison [1992] and Bishop [1991]. 

4. The test score we report is the sum of the “formula” score on the 
mathematics, vocabulary, and reading exams. Students received one point 
for each correct answer and lost a fraction of a point for each incorrect an- 
swer (where the fraction depends on the number of possible answers). The 
maximum possible score on the JOTH GRADE TEST SCORE is 68. 

5. All individual and school variables were constructed from either the 
composite variables in the HS&B data set or were taken from the base-year 
survey. The summary statistics in Table I are unweighted and thus do not 
represent an accurate picture of 1980 high school sophomores. We have not 
used sample weights in our econometric work. 

6. The definition of these two outcome measures is not quite as straight- 
forward as one might think. For example, we do not count students earning 
GED’s as high school graduates. This is a reasonable restriction given re- 
cent work by Cameron and Heckman [1993], who find that graduates with 
GED’s do not perform as well in the labor market as students with regular 
high school diplomas. Similarly, we do not count people who went to college 
long after graduating from high school and people who attended a two-year 
college as college students. Restricting our attention to students entering a 
four-year college is arguable given work by Kane and Rouse [1993] who find 
that credit hours from two- and four-year colleges are rewarded equally in 
the workforce. Rouse [1995] also finds that, on net, community colleges in- 
crease total years of schooling but do not alter the probability of obtaining 
an undergraduate degree. As we demonstrate later, these assumptions are not 
critical. 

7. There is reason to believe that most of the missing income values are 
from families with low income. Students were given a breakdown of family 
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income by thirds and asked in what portion of the income distribution does 
their family fall. Using sample weights from the second follow-up survey, 
a total of 29 percent and 27 percent of the students reported being in the top 
two-thirds of the income distribution, respectively, while only 13 percent said 
that their family was in the bottom third (the rest did not respond). 

8. The test quartiles were calculated for the entire sample using second 
follow-up sample weights. 

9. Bryk, Lee, and Holland [1993] found similar results for Catholic 
schools in their analysis of the HS&B test score data. Using quantile re- 
gression techniques, Evans and Schwab [1993] also found that the benefits 
of a Catholic education on test scores are concentrated among the least able 
students, students whose parents have little education and students from low- 
income families. 

10. We calculated the marginal effects for the “average” public school 
student, who we defined as a seventeen-year-old white female, living with 
both natural parents, in a family where at least one parent has a high school 
diploma, family income is between $16,000 and $20,000, who attends reli- 
gious services regularly, and who lives in a suburb in the south. 

11. All of the college graduation models we present in this paper are 
estimated on the subsample of students who graduated from high school. 
Within the entire sample, 26 percent of the public school students and 53 per- 
cent of the Catholic school students entered college. The average treatment 
effect in the college model presented in Table III using the entire sample is 
0.217 with a standard error of 0.020. 

12. In our sample, the sophomore test score is missing for 20 percent 
of the public school students and 11 percent of the Catholic school students. 
High school completion rates are 84 percent for students with a valid test 
score, but only 74 percent for students without a score. 

13. The marginal effects are calculated for the reference individual de- 
fined in Table Ill. In addition, we assume that this student’s test score equals 
the median public school score in our sample. The marginal effects (standard 
errors) for the JOTH GRADE TEST SCORE in the high school completion 
and college entrance models are 0.004 (0.0002) and 0.013 (0.001), respec- 
tively. These results suggest that a one-standard-deviation increase in the test 
score over the median value (about a fifteen-point increase) would increase 
high school completion and college entrance probabilities by six and twenty 
percentage points, respectively. 

14. See Jencks and Mayer [1990] for a review of the literature on peer 
effects, and see Mayer [1991] for an estimate of the effects of peer groups 
on high school completion rates. Both of these studies are concerned with 
single-equation estimates of the effects of peers on the economic outcomes 
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of teens. Evans, Oates, and Schwab [1992] argue that because families can 
choose among schools and neighborhoods, a student’s peer group is a poten- 
tially endogenous variable. We do not consider the endogeneity of the peer 
measures in this paper. 

15. HS&B did collect information at the school level which could be 
used directly to form peer group measures. As with the test score data, 
however, these variables are missing for many schools (especially public 
schools). Although the peer group measures we constructed are based on 
a sample rather than a census of students from a high school, the large num- 
ber of observations per school should provide us with a good approximation 
of the composition of the school. We have tested this argument by using this 
same procedure to construct a measure of the proportion of the students in a 
school who are black and comparing this estimate with the figure reported in 
the school survey. The correlation coefficient for these two series is 0.97. 

16. The marginal effects are calculated for a student who has an aver- 
age public school value of the peer group variables. Because of space lim- 
itations, we do not report the parameter estimates for all seven peer group 
measures in both models. We note that the peer group variables measuring 
parents’ education tended to be more important determinants of high school 
completion and college entrance than measures of income. In fact, once we 
included parents’ education, the peer measures for income became largely 
insignificant. The marginal effects (standard errors) for the peer group vari- 
ables measuring parents’ education in the high school completion model are 
as follows: % PARENT EDUCATION MISSING —0.24 (0.06), % PARENT 
EDUCATION LESS THAN HIGH SCHOOL —0.12 (0.05), % PARENT ED- 
UCATION HIGH SCHOOL GRADUATE —0.11 (0.04), % PARENT EDU- 
CATION SOME COLLEGE —0.20 (0.05). The corresponding values for the 
college entrance model are —0.63 (0.10), —0.35 (0.07), —0.57 (0.05), —0.50 
(0.08). The reference group in both models is the percent of students in the 
school whose parents are college educated. 

17. To calculate the marginal effects for these two models, we assume 
that the individual owned all four items. 

18. We calculated marginal effects for a student who lived in the state 
with the most observations in our data set. 

19. The evidence from the existing literature on the role of student se- 
lection in the success of Catholic schools is somewhat mixed. Bryk, Lee, 
and Holland [1993] argue that, in general, Catholic schools are not highly 
selective in their admissions. They find that the typical Catholic school ac- 
cepts 88 percent of the students who apply. They also argue that contrary to 
widespread belief, very few students are expelled from Catholic schools for 
either academic or disciplinary grounds. On average, Catholic high schools 
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dismiss fewer than two students per year. Witte [1990] presents evidence that 
Catholic schools do in fact screen admissions so that they are able to avoid 
students who are likely to do poorly. For example, he finds that 55.5 percent 
of Catholic school principals, as compared with only 8.4 percent of public 
school principals, indicated that prior academic record was an important fac- 
tor in admission decisions. 

20. These results are available upon request. 

21. The Association of Statistics of American Religious Bodies 
(ASARB) provided us with data on the Catholic population by county. Their 
data are drawn from a survey of over 200,000 congregations and churches 
with total membership of nearly 115 million. See Quinn et al. [1982] for 
a discussion of these data. With the ASARB data and data from the 1980 
Census, we then constructed an estimate of the percent Catholic at the county 
level. County identifiers are not available in the public use HS&B data. 
We have entered into an agreement with the U. S. Department of Education 
where we created a data set that included the percent Catholics in a county 
and county FIPS codes. The contractor for the HS&B data set then merged 
the data set we created with student identification numbers. In order to pro- 
tect the confidentiality of the data, the percent Catholic in the county vari- 
able was grouped (0.0-4.9 percent, 5.0-9.9 percent, etc.) and top-coded at 
70 percent. 

22. This hypothesis is easily validated. In a first-stage probit model 
where CATHOLIC SCHOOL is the dependent variable, the coefficient on 
% CATHOLIC IN COUNTY is .001 with a standard error of 3.1 x 1074. 
To put this result into perspective, moving a student from the twenty-fifth 
percentile % CATHOLIC IN THE COUNTY to the seventy-fifth percentile 
increases the probability that the student will attend a Catholic school by ten 
percentage points. 

23. Implicitly, we have treated % CATHOLIC IN COUNTY as an ex- 
ogenous variable. It will be correlated with the error term in the outcome 
equations if, for example, families that care a great deal about education 
move to counties where many Catholics live in order to take advantage of the 
availability of Catholic schools or lower tuition as a member of the parish. 
This argument could explain the problems we have found when we try to use 
this variable as an instrument. 

24. There is substantial disagreement over this issue in the literature. 
See, for example, Bryk, Lee, and Holland [1993] and Chubb and Moe [1990] 
for two very different views. 
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Abstract 


The interplay between education and fertility has a significant influence on 
the roles women occupy, when in their life cycle they occupy these roles, 
and the length of time spent in these roles. The overall inverse relationship 
between education and fertility is well known; but little is known about the 
theoretical and empirical basis of this relationship. This paper explores 
the theoretical linkages between education and fertility and then examines 
the relationships between the two at three stages in the life cycle. It is 
found that the reciprocal relationship between education and age at first birth 
is dominated by the effect from education to age at first birth with only a triv- 
ial effect in the other direction. Once the process of childbearing has begun, 
education has essentially no direct effect on fertility; but it has a large indi- 
rect effect through age at first birth. 


No factor has a greater impact on the roles women occupy than maternity. 
Whether a woman becomes a mother!, the age at which she does so, and the 
timing and number of her subsequent births set the conditions under which 
other roles are assumed. Some may deplore this situation and it may be 
changing, but the dominance of motherhood continues to be a fact for the vast 
majority of women. While there is clearly variance in this role dominance, 
the assumption of nonfamilial roles varies markedly with the fact, timing, 
and extent of maternity. 

Education is another prime factor conditioning female roles. Educa- 
tion is expected to impart values, aspirations, and skills which encourage and 
facilitate nonfamilial roles. It is possible that better educated women may 
assume less traditional role patterns than less-educated women with identical 
fertility histories. However, it is also likely that education affects women’s 
roles through differing patterns of fertility. This paper discusses some of 
the possible linkages between education and fertility and reports analyses 
bearing on: (1) the relationship between education and age at first birth, 
(2) the effects of education on the timing of subsequent births, particularly 
on the experience of short birth intervals, and (3) educational differences in 
wanted family sizes. 


Education—Fertility Linkage 


Given the importance of the interplay between education and fertility for the 
roles women occupy in industrialized societies, there has been surprisingly 
little attention paid to the causal linkages between the two.” In part, this may 
be because the possible causal connection between fertility and education 
is exceedingly complex. Some have assumed that education affects fertility 
(e.g., Westoff and Ryder, 1977; Rindfuss and Sweet, 1977; Cho et al., 1970; 
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Whelpton et al., 1966), and some have argued that fertility also affects edu- 
cation (Waite and Moore, 1978). 

Most of the theory and research concerned with education and fertility 
conceptualizes both in terms of their end products: completed education 
and children ever born. In fact, children come one at a time (usually), and ed- 
ucation is completed a year at a time, sometimes a course at a time. Children 
can come close together, or at intervals of 10, 15 or even 20 years. Formal 
schooling can be completed without interruption; or it can be completed after 
short or long interruptions (Davis and Bumpass, 1976). Models of education 
and fertility should reflect the fact that education and fertility are processes 
which take time to complete and which can intercept each other in complex 
ways. 

The overall relationship between education and fertility has its roots at 
some unspecified point in adolescence, or perhaps even earlier. At this point 
aspirations for educational attainment as a goal in itself and for adult roles 
that have implications for educational attainment first emerge. The desire for 
education as a measure of status and ability in academic work may encourage 
women to select occupational goals that require a high level of educational 
attainment. Conversely, particular occupational or role aspirations may set 
standards of education that must be achieved. The obverse is true for those 
with either low educational or occupational goals. Also, occupational and 
educational aspirations are affected by a number of prior factors, such as 
mother’s education, father’s education, family income, intellectual ability, 
prior educational experiences, race, and number of siblings (for example, see 
Hout and Morgan, 1975). 

Occupational and educational aspirations are also reciprocally related to 
evolving fertility preferences. These fertility preferences include both num- 
ber and timing preferences, that is, whether a first birth is wanted ever and, 
if so, when. The number and timing preferences may be related if, for ex- 
ample, a desire for many children leads to a desire to begin childbearing as 
soon as possible (Bumpass and Westoff, 1970). Moreover, the preference 
for postponing a first birth may lead to interests in other areas which may 
then lead to a decision not to have any children. There is evidence that re- 
peated postponement of the first birth is a typical pattern among those who 
are voluntarily childless (Veevers, 1973). Such preferences for timing are 
necessarily vague, but nonetheless important. Some young women may wish 
to have a baby as soon as possible, perhaps to establish an adult identity sep- 
arate from their parents, or to fulfill strong nurturing needs. Such aspirations 
among young women are likely to have a negative effect on evolving role 
and educational aspirations. Similarly, a young woman who is sure she does 
not want to have a child any time soon, if at all, may expand her role and 
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educational aspirations accordingly. Influences in the opposite direction op- 
erate as the threat of early fertility to educational attainment are recognized 
and fertility desires are adjusted accordingly. 

Both of these preference sets (occupational and educational aspirations 
as well as fertility preferences) influence actual age and education at first 
birth through a set of intervening variables* that include the standard in- 
termediate variables affecting exposure to intercourse, conception risk and 
gestation and parturition (Davis and Blake, 1956; Bongaarts, 1978). 

Adolescents with higher educational and occupational goals may choose 
social patterns that are less likely to lead to early marriage, that is, “not want- 
ing to go steady or get serious with boys,’ because they want to go to col- 
lege. They may be less willing to engage in intercourse because of the threat 
of possible pregnancy to their educational or career plans. Sexually active 
adolescents with high educational aspirations may be more likely to try to 
control the risk of pregnancy through careful contraceptive use. 

Adolescent women who desire early motherhood (and presumably early 
marriage) are likely to follow social patterns that lead to early intensive emo- 
tional involvement; and, when sexually active, this group may have rela- 
tively low motivation to avoid pregnancy. Such patterns may lead indirectly 
to lower educational achievement because of an early age at first birth. 

Early marriage may have a direct effect on reducing educational attain- 
ment,’ for example, when a girl leaves school in order to be married. These 
social patterns also have an indirect effect on education through factors af- 
fecting pregnancy and early age at first birth. 

It should be noted that in the reciprocal relationship between education 
and age at first birth, the effects of education on age at first birth can only 
be the result of the intermediate variables discussed above (also, see Davis 
and Blake, 1956; Bongaarts, 1978) whereas the effect of age at first birth on 
education may also include a direct effect. 

Both age and education at first birth can affect subsequent role and ed- 
ucational aspirations, and subsequent preferences for the timing and number 
of children. These subsequent aspirations and preferences are also recipro- 
cally related. After the birth of their first child some women may find that 
they wish to reduce their fertility goals, increase their occupational goals, and 
return to school. Others who had planned on continuing their education may 
decide to have more children, or to quickly become pregnant again, either 
because of great satisfaction in the mother role, or because of a sense that it 
is an all-consuming role that precludes other options, or because they are not 
sure of what else to do. 

Education, age at first birth, the possibly revised occupational and ed- 
ucational aspirations, as well as timing and number preferences all affect 
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various aspects of the intermediate variables in a process similar to that elab- 
orated above with respect to the period before the first birth. The period prior 
to the first birth includes an unmarried and sexually inactive period as well 
as a married interval for most women. For most (but not all) women, the 
period following the first birth begins within marriage. Some women will 
not yet be married and others will have married and separated or divorced 
by the time of the first birth. At the second birth, a woman may be never 
married, currently married, widowed, divorced or separated (Rindfuss and 
Bumpass, 1977). Marital instability is an important social factor in the social 
patterns category in each segment to the extent that it affects other interme- 
diate variables such as frequency of intercourse, periods of abstinence, and 
use of contraceptives. 

Fecundity is largely exogenous to the processes we are examining, 
though it has a clear effect on the timing of the first birth and may medi- 
ate the effect of age at first birth on subsequent fertility. 

While these potential intersections in the relationship between education 
and fertility warrant more intensive study, that is not our purpose in this paper. 
The point we are attempting to make in the preceding discussion is that the 
observed relationship between completed education and completed family 
size is the cumulative outcome of a complex process that involves attitudes 
and decisions about both education and fertility that may change as time 
passes or as the woman moves from one stage to the next, and that it is 
necessary to examine empirically the various stages in the process. 


Data 


The data used are from the 1970 National Fertility Study (NFS), a multi- 
purpose study based on a national probability sample of 6,752 ever mar- 
ried women under 45 years of age residing in the continental United States 
(Westoff and Ryder, 1977). Complete birth and pregnancy histories were 
obtained, thus permitting analysis of age at first birth and of birth intervals. 
Unfortunately, a complete educational history was not obtained. Only edu- 
cation at interview and education at marriage were obtained. This means that 
we have to use education at marriage as a proxy for education at first birth. 
For most women this is a reasonable proxy, since the correlation between 
age at first birth and age at first marriage is 0.74. In order to check the 
reasonableness of using education at marriage, we reran all the analyses 
using education at interview, and results were unaffected. However, it should 
be recognized that for younger mothers the first birth is likely to precede 
the first marriage. Finally, it should be noted that there were no questions 
asked about educational or occupational aspirations during the adolescent 
and young adult years. 
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Although not reported in detail here, wherever possible we have also ex- 
amined data from the 1973 National Survey of Family Growth (FGS) (NCHS, 
1978) (a national probability sample of 9,797 women under age 45 who had 
ever been married or who were never married mothers in 1973), and essen- 
tially comparable results were found in both data sets. 


Education and Age at First Birth 


In the absence of accurate data on the intermediate variables, the relationship 
between the fertility and educational processes can be conceptualized as a 
simple causal process. The aspirations, plans, and decisions (and “apparent” 
nondecisions) leading to an early first birth may result in lowered educational 
aspirations and achievement. Women who desire and obtain a high level of 
education may adjust their fertility preferences accordingly. Both the educa- 
tional and the first birth process are affected by a set of exogenous factors re- 
flecting background characteristics and characteristics of early adolescence. 
A model of these relationships is shown in Figure 1. The rationale for this set 
of exogenous variables, and their effects on education and age at first birth, 
is considered elsewhere (Rindfuss and St. John, 1979); in the present paper 
we concentrate only on the relationship between education and age at first 
birth. Table 1 indicates the measurement of these exogenous variables, and 
the Appendix reports the zero-order correlations among all the variables in 
Figure 1. 

That the relationship between education and age at first birth should be 
viewed as potentially reciprocal is often overlooked: one direction of causa- 
tion is usually emphasized to the exclusion of the other. For example, Jaffe 
(1977:22) asserts: “Pregnancy is the most common cause of school dropout 
among adolescent girls in the U.S.” Others, however, contend that educa- 
tion determines age at first birth; and, further, that women who get pregnant 
while still in school do so to have an “acceptable” reason for dropping out of 
school (Cutright, 1973). Since there is considerable overlap in the time when 
women leave school and the time when they have their first child (median 
age at first birth is currently about 22, and, of recent cohorts, 25% have their 
first birth by the end of the 19th year), it is important to investigate the extent 
to which the educational attainment process and age at first birth process are 
reciprocally related. 

The part of the model shown in Figure 1 of direct interest here is the re- 
lationship between education and age at first birth. We allow for a reciprocal 
relationship between these two variables, with each affected by other vari- 
ables in the model as well. Age at first birth is computed from the date of 
respondent’s birth and date of birth of respondent’s first child. Education, 
as noted, is education at marriage, not education at first birth. In order to 
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Figure 1. A Model of the Relationship between 
Educational Attainment and the Beginning of 
Motherhood. 
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estimate the reciprocal relationships between education and age at first birth, 
instrumental variables are needed for each of the two endogenous variables— 
that is, variables are needed which directly affect one of the endogenous vari- 
ables but not the other, which are not causally determined by the endogenous 
variables, and which are not correlated with the unspecified source of the en- 
dogenous variable for which it is not an instrument (Duncan, 1975; Heise, 
1975). As can be seen from Figure 1, fecundity is used as the instrument for 
age at first birth and respondent’s father’s occupation as the instrument for ed- 
ucation. Fecundity is measured by whether or not the respondent had a mis- 
carriage prior to her first birth.> A miscarriage before the first birth postpones 
the first birth in a direct and obvious way: it takes time to conceive again 
and carry that conception to successful parturition. It also gives the woman 
a second chance if she wants to contracept. The additional time involved 
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as the result of a miscarriage before the first birth can be substantial since ap- 
proximately one-fourth of the women who have one miscarriage before their 
first birth have two or more miscarriages before their first birth. 

A miscarriage before the first birth should have no effect on education, 
except indirectly through age at first birth. This would occur only if the 
woman dropped out or was expelled from school prior to the miscarriage be- 
cause of the pregnancy. If this were the case, then the miscarriage would 
be correlated with the disturbances in the education equation and would be 
unsuitable as an instrument. However, this is unlikely because the vast ma- 
jority of miscarriages occur in the early months of a pregnancy, before it is 
obvious to observers that the woman is pregnant, and often before the woman 
knows that she is pregnant (see National Center for Health Statistics, 1966). 
If the woman is unmarried, she is unlikely to notify the school that she is 
pregnant until it becomes absolutely necessary. It is probably in part for this 
reason that unmarried women often do not seek prenatal care until very late in 
pregnancy (National Academy of Sciences, 1973). Furthermore, education 
should not have any effect on whether or not there is a miscarriage before 
the first birth. The only exception to this statement would involve a woman 
obtaining an induced abortion in order to complete her education. However, 
induced abortions are so grossly underreported in United States fertility sur- 
veys that reported miscarriages are essentially spontaneous miscarriages. 

That respondent’s father’s occupation affects respondent’s educational 
attainment is well known (Alexander and Eckland, 1974; Blau and Duncan, 
1967; Kerckhoff and Campbell, 1977; Sewell and Hauser, 1977) and does 
not require further elaboration here. We also argue that father’s occupation 
does not have a direct relationship with age at first birth. Rather, we would 
argue that the relationship is indirect through education. It can be expected 
that families of an orientation in which the father has a high status job would 
be more likely to encourage daughters to postpone the first birth than fami- 
lies of an orientation in which the father has a low status job. However, the 
most likely explicit and implicit justification for this encouragement would 
be to allow daughters time to complete their education, and thus the effect 
on age at first birth would be indirect. However, there may also be an in- 
tergenerational transmission of norms regarding age at first birth. (Leonetti 
[1978] provides a good example of this in the case of Japanese—Americans.) 
To the extent that socioeconomic status directly affects the intergenerational 
transmission of norms regarding age at first birth—that is, in addition to the 
indirect transmission through educational aspirations—then respondent’s fa- 
ther’s occupation would not be a suitable instrument for education. Recent 
work by Thornton (forthcoming) suggests that there is no direct transmission 
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of fertility norms from parental status. Instead, this influence was transmitted 
through the education of the offspnng. 

In order to examine our assumption that parental socioeconomic status 
does not have a direct effect on the transmission of norms regarding age at 
first birth, we examined the determinants of ideal age at first birth. The 1970 
NFS included the following question: 

“Q. 3: What do you think is the ideal age for a woman to have her first 
child?” 

Although this question suffers from all the problems of “ideal” ques- 
tions (Blake, 1966; Bumpass and Westoff, 1970; Rindfuss, 1974; Ryder and 
Westoff, 1969) as well as some problems specific to this question (Rindfuss 
and Bumpass, 1978), it does provide the best measure available for norms 
regarding age at first birth. Using a sample of recently married women in 
order to minimize the possibility that the responses to the question would 
be affected by the cumulative maternal experience of the woman, we find 
that, after other appropriate factors are controlled, father’s occupation has 
no significant direct effect on the ideal age to have a first birth. This fur- 
ther supports Thornton’s results and supports the theoretical argument that 
parental socioeconomic status influences age at first birth only indirectly 
through its effect on the offspring’s educational aspirations, and thus sup- 
ports the use of father’s occupation as an instrument for education in our 
model. 

However, somewhat less consistent support was found in an examina- 
tion of the 1971 National Survey of Young Women data. Since father’s oc- 
cupation was not available, the relationship between father’s education and 
ideal age at first birth was considered for this sample of teenagers 15-19 
years of age. While most of the association is accounted for by educational 
aspirations, ideal age at first birth is 0.4 years lower among the children of 
high school graduates than among those of fathers who attended college, net 
of other factors. While this modest net effect of father’s education cautions 
our theoretical position, we would expect the net effect of father’s occupation 
on ideal age at first birth to be considerably weaker. 

Before presenting the results, it is necessary to discuss some of the vari- 
ables which are not included in Figure 1 and the possible biases their exclu- 
sion might introduce. The first is marriage. Although we recognized the role 
of age at marriage in the earlier discussion in this paper (especially since it is 
incorporated in sexual experience), it is age at first birth that is emphasized 
both there and in our analysis here. Clearly, age at first marriage and age 
at first birth are closely related, normatively and empirically. However, we 
feel that the first birth has greater consequences for the life style and roles of 
the woman (Rindfuss, 1979), and that the effects of the first birth are more 
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permanent than those of first marriage. Marini (1978) has recently argued 
that age at marriage is more important than age at first birth in the transition 
to adulthood because age at marriage “usually sets a lower limit on the age at 
which first birth occurs.” We disagree for the following reasons: In the first 
place, motherhood frequently precedes first marriage. (And this is more 
likely to be the case the younger the age at first birth.) Second, some people 
may initiate the serious consideration of marriage on the basis of when they 
want to begin parenthood, as reflected in the phrase “time to settle down and 
start a family.” The high incidence of premarital intercourse argues against 
the notion that age at first marriage sets a lower bound on exposure to the risk 
of conception. Third, “becoming a parent” is the modal response of married 
parents to the question of what marks the transition to adulthood (Hoffman, 
1978). Fourth, parenthood is more permanent than marriage, particularly 
for women since children tend to stay with the mother following a marital 
disruption. Preston (1975) has estimated that almost half of the current mar- 
riages will end in divorce; thus, women often move in and out of the wife 
role. Finally, and perhaps most importantly, motherhood roles more severely 
constrain other life options of a woman than do marital roles, especially dur- 
ing the early childbearing years. For these reasons, our emphasis is on age at 
first birth. Given the high correlation between age at first birth and age at first 
marriage, and given that both are affected by similar exogenous variables, we 
have not included both in the analysis. Furthermore, given the assumptions 
of the model, the exclusion of age at first marriage will not bias our estimates 
of the relative importance of the processes leading to educational attainment 
and to the first birth. 

In order to allow women sufficient time to get married (and, thus, be 
eligible to be in the sample) and to have a first birth, the analysis of the 
education-age at first birth relationship will be limited to women aged 35—44. 
Most of those who will ever marry before the end of the reproductive period 
are married by age 35. For example, the proportion of women ever married 
increases from 0.873 at ages 25-29 to 0.926 at ages 30-34 to 0.941 at ages 
35-39. But the proportion of women ever married increases only slightly to 
0.946 at ages 40-44 (U.S. Bureau of the Census, 1972). The same holds true 
for first births. Most of those who will ever give birth do so by age 35. For 
example, 79.2% of the birth cohort of 1930-1934 had a live birth by ages 25- 
29, 87.7% did so by ages 30-34, and 90.2% had a live birth by ages 35-39. 
This percentage increased only slightly to 90.8% by ages 40-44. Less than 
3% of the women in this birth cohort who had a live birth had it after age 35 
(Heuser, 1976). 

Childless women are excluded from this analysis at age at first birth. 
Only a small proportion (less than 10%) of the married women in these 
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cohorts remained childless (Heuser, 1976). To the extent that postpone- 
ment leads to voluntary childlessness (Veevers, 1973), this exclusion could 
lead to a weaker estimated effect of age at first birth than actually exists. 
However, childlessness in these cohorts was primarily a product of fecundity 
impairments. 

The model shown in Figure 1 includes background characteristics, as- 
pects of early adolescence, and the reciprocal relationship between education 
and age at first birth. Period factors are not included, and this needs to be kept 
in mind when interpreting our results. The respondents in this analysis were 
aged 35—44 in 1970. Taking 15 as the youngest age at first birth and 35 as 
the oldest means that these women were having their first births from 1941 
to 1970. During this long period, there were a number of events affecting the 
timing of fertility, including World War II, the Korean War, and the Vietnam 
War. Those women who postponed their first birth were, of course, exposed 
to more of these period factors, which could affect the timing of their first 
birth. Since so little is known about the nature of period factors that af- 
fect the timing of fertility (Rindfuss et al., 1978), they cannot be explicitly 
included in the analysis. Furthermore, the younger women in our sample 
experienced the period factors at different ages than the older women in the 
sample. To see if this would affect our results, we ran the model separately 
for women aged 35-39 and 40-44. The results were virtually identical for 
the two groups. 

The work of Easterlin (1962; 1966 and 1973) and others suggests that 
the financial status of the respondent’s family of orientation while the re- 
spondent was an adolescent will affect the age at which she has her first 
child. Unfortunately, we do not have a direct measure of the respondent’s 
parents’ financial status while the respondent was an adolescent. However, a 
number of background variables in the model, such as race, number of sib- 
lings, farm background, regional background, and family composition when 
respondent was 14, indirectly control for the respondent’s family’s financial 
situation. 

Further, the model shown in Figure 1 also does not include the labor 
force experiences of women. As noted earlier, labor force experiences and 
aspirations are likely to affect, and be affected by, childbearing and child- 
bearing preferences. In fact, there is a long literature on this relationship 
(see Waite and Stolzenberg, 1976; and Smith-Lovin and Tickamyer, 1978, 
for recent summaries of this literature). Unfortunately, adequate labor force 
participation information is not available. 

Estimation of the effects shown in Figure 1 was accomplished by us- 
ing two-stage least squares regression analysis (Goldberger, 1964; Johnston, 
1972). The estimates were made using ordinary least squares in two steps, 
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Table 2. Metric and Standardized Coefficients Measuring the Reciprocal 
Relationship between Education and Age at First Birth, 1970 NFS®. 


Independent Dependent Metric Standardized 
Variable Variable Coefficient Coefficient 
Education Age at First Birth 0.741* 0.429* 
Age at First Birth Education 0.075 0.130 


Correlation of Disturbances (U and V): —0.255 


"N= 1,766. 
* Significant at 0.05. 


making the appropriate corrections as outlined by Hout (1977). The results 
are shown in Table 2. 

This table shows only the results for the endogenous variables; results 
for the complete model are reported and discussed elsewhere (Rindfuss and 
St. John, 1979). 

The effect of education on age at first birth is significant—both statisti- 
cally and substantively. Each additional year of schooling results in the delay 
of the first birth by approximately three-quarters of a year. However, the ef- 
fect of age at first birth on education is not statistically significant; and even 
if it were, the effect would be trivial substantively. 

The results shown in Table 2 are based on the assumption of linear ef- 
fects. It might be argued that the effect of age at first birth on education is 
not linear. The inclination to have a birth at a very young age may have more 
serious effects on educational plans than the preference to have a child at a 
later age. The potential conflict between school and motherhood is greatest 
at the younger ages at first birth. This suggests that a nonlinear age at first 
birth effect on education should be specified. Such a specification should 
force a difference of a year at the younger ages at first birth to be larger than 
a difference of a year at the older ages at first birth. We used three differ- 
ent transformations of age at first birth (AGEFST) to explore this possibility: 
(1) LN (AGEFST), (2) 1/AGEFST, and (3) 1/(AGEFST)*. The model shown 
in Figure 1 was reestimated for each of these three transformations. In each 
case the results are the same as the linear model: age at first birth does not 
have a significant effect on educational attainment. 

Furthermore, there is some evidence to suggest that the family building 
process may be different for whites and blacks. For example, blacks have 
higher illegitimacy rates than whites (NCHS, 1977), and blacks appear to 
rely more heavily on relatives to take temporary, but primary, care of children 
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born to young mothers than whites (Rindfuss, 1977). In order to check for 
a potential interaction with race, we reran the analysis separately for whites 
and blacks.° The important point for the present analysis is that, for both 
blacks and whites, education has a strong and significant effect on age at first 
birth, but age at first birth has an insignificant effect on education. Thus, our 
results are unaffected by any racial interaction. 

In the relationship between education and age at first birth, the prin- 
cipal direction of causality is from education to age at first birth. Those 
who have recently examined the relationship between education and age at 
first marriage have found corroborating results (Marini, 1978; Alexander and 
Eckland, 1978), namely, that education has a much stronger effect on age 
at first marriage than age at first marriage has on education. Given the sheer 
amount of time the mother role requires in contrast to the wife role, the timing 
of the first birth has greater consequences for the roles women occupy. Yet, it 
is interesting to note that (ignoring the differences between the samples used 
here and those used by Marini [1978] and Alexander and Eckland [1978]), 
age at first marriage appears to have a somewhat greater effect on education 
than age at first birth. Even though age at first birth has a greater effect on the 
roles occupied by women, age at first marriage could have a stronger effect 
on educational attainment because first marriage schedules are younger and 
more compact than first birth schedules. Thus, more marriages take place 
during the years in which women are in school. 

The finding that age at first birth has only a very small effect on educa- 
tional attainment may seem paradoxical, given the social policy concern with 
the pregnant girls who have to drop out of school and face reduced social op- 
portunities as a consequence. Such a fate is unquestionably experienced by 
some women, particularly those among the 3% to 6% of the American co- 
hort that have had a first birth before age 17. But the fact is that the vast 
majority of women do not get pregnant while they are enrolled in school. 
Even among those who do become mothers at ages at which society expects 
one to be in school, the direction of causality might run from education to 
fertility. Zelnik and Kantner (1978) and Ross (1978) suggest that a signifi- 
cant minority of premarital pregnancies were intentional. To further explore 
this issue, we compared the age at leaving school’ with age at first birth for 
women who become mothers at age 17 or younger. If leaving school and 
the first birth occur in the same year, it is ambiguous which process domi- 
nates. But for those who left school more than a year before their first birth, 
one can assume that the educational process is affecting the fertility process. 
Surprisingly, more than 40% of the women who had a first birth at age 17 or 
less dropped out of school at least a year prior to becoming a mother—which 
suggests that even at the very young ages at motherhood, the fertility process 
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is being affected by the educational process. Further, there is longitudinal 
evidence showing a negative relation between educational aspirations and 
age at first birth (Marshall and Cosby, 1977; Card and Wise, 1978, Table 3), 
which suggests that many of those who have a first birth while they are of 
school age do so after deciding not to continue in school—and, perhaps, do 
so to justify dropping out of school. Finally, Haggstrom and Morrison (1979) 
find that among teenagers who do not drop out of high school, the effects of 
adolescent parenthood on subsequent educational aspirations are extremely 
small when other appropriate factors are controlled. All of this does not mean 
that fertility never truncates education, but only that it does so rarely. In the 
vast majority of the cases, education and educational aspirations determine 
age at first birth. 

It is important that scientific discourse clarify the difference between a 
social policy concern that requires amelioration and the characterization of 
the overall process in which that concern is embedded. 


Education and the Lengths of Birth Intervals 


As discussed in the first section of this paper, we would expect to find a 
variety of reasons why women with more education would want to avoid 
very short birth intervals and we would expect them to be more effective at 
implementing their preferences. In this section we examine the relationship 
between education and the probability of having a short interbirth interval. 
Unlike the previous section, here, we assume that the direction of causality 
runs from education to the length of birth intervals.? 

The birth history information contained in the 1970 NFS allows us to 
compute the length of each birth interval. Given the well-known difficulties 
involved in the analysis of birth intervals (see Bumpass et al., 1977, for a 
fuller discussion), we initially constructed life tables for each birth interval. 
These preliminary life tables were constructed for intervals begun in the pe- 
riod 1959-1968. By restricting the analysis to intervals begun in this period, 
we avoid a young-age-at-initiation bias (see Rindfuss and Bumpass, 1979). 

The preliminary life table analyses showed the expected positive rela- 
tionship between education and length of intervals. However, this conclu- 
sion is based on a bivariate analysis, and there are numerous other factors 
affecting the length of birth intervals (e.g., Bumpass et al., 1978), and the 
effects of these factors should be controlled. Unfortunately, the sample size 
of the 1970 NFS (or the 1973 FGS) is far too small to permit the simultane- 
ous control of all these factors by using conventional life table techniques. 
Consequently, we used regression analysis to examine the probability of giv- 
ing birth within a relatively short time interval—specifically, the probability 
of giving birth within 18 months of the previous birth. Because the life table 
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Table 3. Differentials in the Proportion Experiencing Birth Intervals of 
18 Months or Less, for All Second, Third and Fourth Birth Intervals Begun 
1959-1968, by Education, Gross and Net? Percentages: 1970 NFS 


Second Birth Interval Third Birth Interval Fourth Birth Interval 
Education at 
Marriage N Gross Net N Gross Net N Gross Net 
Total 2612 25 2236 19 1551 17 
1-8 155 33 31 168 32 29 164 29 25 
9-11 657 28 26 592 21 18 433 20 17 
12 1218 24 23 1016 16 17 670 15 16 
13-15 388 22 24 312 17 20 202 11 14 
16+ 194 20 29 149 16 21 82 13 18 


Adjusted through a dummy variable regression analysis for the effects of race, religion, region, 
age at first birth, marital status at first birth, contraceptive use before first birth, planning status 
of first birth and smoking before age 16. 


results suggested that the differences in interbirth interval length are greater 
between adjoining categories at the lower educational categories than at the 
higher educational categories, we used a variant of multiple regression anal- 
ysis, Multiple Classification Analysis (Andrews et al., 1973), to see if this 
pattern continued when other factors were controlled. The results are sum- 
marized in Table 3. 

Controlling for other factors that affect the length of interbirth intervals 
eliminates much of the relationship between education and the probability 
of having a short birth interval. Compare the gross and net columns for the 
second, third and fourth birth intervals.!° The difference which remains after 
controlling for other variables is primarily between those with a grade school 
education and all others. Given that those with only a grade school education 
are a small proportion of the population, and since the proportion with only 
a grade school education is declining, the principal result to emerge from 
Table 3 is that, when the effects of other factors are controlled, the respon- 
dent’s education at first marriage has essentially no effect on the probability 
of having a short second, third or fourth birth interval.!! 


Education and Fertility Preferences 


As discussed earlier in this paper, educational preferences and fertility pref- 
erences affect each other; and, since neither is fixed, their interrelationship 
develops over time. To examine adequately this complex set of interrela- 
tionships would require longitudinal data of the kind not currently available. 
However, in the absence of the appropriate longitudinal data, it is still pos- 
sible to examine part of the process by looking at the effect of education at 
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marriage on fertility preferences at time of interview. Framed this way, the 
causal direction is essentially unambiguous. 

Education at marriage can affect fertility preferences in two ways. First, 
education at marriage can have a direct effect on fertility preferences. Insofar 
as increased education makes a larger variety of roles available to women, we 
could expect education to have a direct and negative effect on fertility prefer- 
ences. In addition, specific topics covered while in school might have a direct 
negative effect on fertility preferences. Second, education at marriage can 
have an indirect effect on fertility preferences through its effect on age at first 
birth. As shown earlier, higher levels of educational attainment result in older 
ages at first birth. An older age at first birth, in turn, leads to longer intervals 
between births (Bumpass et al., 1978). Thus, education leads to older ages 
at any given parity; and older ages at any given parity have a negative ef- 
fect on the probability of wanting another child (Rindfuss and Bumpass, 
1978). 

The measure of fertility preferences used here, FERTPREF, is the sum 
of the number of “wanted” children the woman had had by the time of the in- 
terview plus the additional number of children she intended to have. For each 
live birth, the woman was asked a series of questions to determine whether or 
not, before that child was conceived, she wanted to have a birth of that order 
at some time during her reproductive life (see Westoff and Ryder, 1977, for 
a more detailed description). Such a series of questions minimizes the pos- 
sibility of post factum rationalization of unwanted births (Rindfuss, 1974). 
The additional number intended is obtained from a question asking the re- 
spondent how many additional children she intended. This fertility prefer- 
ence measure is coded in numbers of children and has a mean of 2.9, and a 
standard deviation of 1.5.1? 

Because one of our interests is in the mediating effect of age at first birth, 
the sample being analyzed is limited to mothers, that is, women who have 
had at least one live birth. As in the previous two sections, in order to allow 
women sufficient time to get married and have a first birth, younger women 
are excluded from the analysis. The analysis in this section, like the age at 
first birth analysis, will be restricted to respondents aged 35-44 at the time 
of the interview. Because the full set of questions used in constructing our 
fertility preference measure was not asked of postmarried women (i.e., those 
widowed, divorced or separated at the time of the interview), the analysis 
will be limited to currently married women. Finally, for ease of presentation, 
the set of exogenous variables to be used here, in addition to education at 
first marriage, is exactly the same as those shown in Figure 1 and described 
in Table 1. We have experimented with other sets of exogenous variables and 
with other definitions of the sample, and the results are similar in all cases. 
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Figure 2. A Model? of the Relationship between Education and Marriage and 
Fertility Preferences (Standardized Coefficients)? 


ED 
ON 
N 
= 
AGEFST a FERTPREF 
Other 
exogenous 
variables? 


TOTAL EFFECT: ED - FERTPREF = -0.058 


* The other exogenous variables in the model are: DADSOCC, RACE, NOSIB, FARMBACK, 
REGNBACK, ADOLFAM, RELIGION, YOUNGCIG, AND FECUND. See Table 1 for a 
description of the measurement of these variables. 

bN=1,551. 

€ Significant at 0.01. 


The results are summarized in Figure 2. In order to focus on the 
education-fertility preference relationship, only the direct and indirect effects 
of education are shown. It can be seen that the direct effect of education on 
fertility preferences is trivial and insignificant. Virtually all of the effects of 
education at marriage on fertility preferences operates through age at first 
birth. Furthermore, the importance of age at first birth in influencing fertility 
preferences at time of interview should be underscored. Although it is not 
shown in Figure 2, age at first birth has a stronger direct effect on fertility 
preferences measured at time of interview than any of the listed exogenous 
variables. Thus, it appears that education affects fertility preferences by sort- 
ing women into various ages at first birth. 

For approximately four-fifths of these women, education at first mar- 
riage is the same as education at interview; but one-fifth of these women 
have attended school, since their first marriage (Davis and Bumpass, 1976). 
For many women, this school attendance takes place a considerable time af- 
ter the first marriage. For example, for women first married between 1951 
and 1955 who returned to school after marriage, 62% last attended school 
10 or more years after the first marriage. This additional schooling could 
affect fertility preferences, or could be affected by fertility preferences. We 
do not have the appropriate data to sort out these possibly reciprocal influ- 
ences. But we did rerun the analysis in Figure 2 using education at inter- 
view instead of education at marriage, and the results are suggestive. The 
finding, as before, is that most of the relationships between education and 
fertility preferences operate through age at first birth. However, the direct 
relationship between education and fertility preferences is somewhat larger 
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when education at interview is used than when education at first marriage is 
used. Without being able to sort out the potential reciprocal effects, we can 
only speculate that education after marriage operates to provide options that 
would not otherwise be available, or is itself a response to (or simultaneous 
with) a decision to terminate childbearing earlier than planned. This issue is 
something that warrants further examination. 


Conclusion 


To summarize, the reciprocal relationship between education and age at first 
birth is dominated by the effect from education to age at birth, with only a 
trivial effect in the other direction. 

Once the process of childbearing has begun, education has essentially 
no direct effect on that process. Education has little direct effect on either 
the length of interbirth interval or on fertility preferences. Work by Vaughn 
and her colleagues (1977) shows that education has no direct effect on con- 
traceptive efficacy. However, education has a significant indirect effect on 
these various components of fertility because it is the major determinant of 
age at the beginning of childbearing; in fact, education has a substantially 
greater influence on age at first birth than any other variable (Rindfuss and 
St. John, 1979). Thus, it is the postponing of motherhood that produces the 
oft-observed negative bivariate relationship between education and children 
ever born. 

The powerful mediating effect of age at first birth is of interest in its own 
right. Older ages at first birth lead to longer interbirth intervals (Bumpass 
et al., 1978), more effective contraceptive use (Vaughn et al., 1977), and pre- 
ferences for fewer children (as shown in the previous section of this 
paper). 

These results, particularly if they are supported by future research on 
more recent cohorts, raise a set of interesting policy issues about which we 
can only speculate at present. Because the postponement of something is al- 
ways more amenable to policy initiatives than its prevention, policies aimed 
at influencing age at first birth would be more likely to succeed than policies 
aimed at directly influencing children ever born. Furthermore, how adoles- 
cents spend their time has been accepted (although not universally) as some- 
thing governments can legitimately influence—the military draft system is 
the most obvious example. 

We began with the observation that a major way education might af- 
fect the roles women occupy is through altering the structure of childbearing 
experience, given the dominance of mother roles. We conclude that such 
educational effects as we can identify are explicable more in terms of ed- 
ucation’s effect on age at first motherhood than in terms of other values or 
aspirations that might derive from advanced schooling. 
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Notes 


1. Here, and throughout the paper, we use the term “mother” in its so- 
cial rather than biological sense. The biological mother is the female who 
gives birth to the child. The social mother need not be the biological mother; 
but, typically, the two are the same. It is the social mother that has primary 
responsibility for the care and nurture of the child. This role need not be oc- 
cupied by a female, but, typically, it is. Also, the word “children” throughout 
this paper is used in its social, rather than biological, sense. 

2. The work of Holsinger and Kasarda (1976) for developing countries 
is an exception. 

3. In actual practice, we know of no case where all the intermediate 
variables are adequately measured. Models are evaluated as if there were 
direct effects, with researchers unable to specify the precise nature of the so- 
cial and economic effects on fertility as they operate through the intermediate 
variables. 

4. Note, however, that Voss (1977) finds a negative effect of age at first 
marriage on educational attainment. Marini (1978) argues, and we agree, that 
this finding of Voss is the result of the lack of an adequate instrument for age 
at first marriage. 

5. There is some evidence that a history of miscarriage greatly increases 
the chance that subsequent conceptions will be terminated by a miscarriage 
(Funderburk et al., 1976; Shapiro et al., 1971). Given the unreliability with 
which fetal losses are reported in pregnancy histories (Bumpass and Westoff, 
1970) and given the fact that very early miscarriages are often unnoticed by 
the woman, we experimented with alternative and more complex measures 
of fecundity which incorporated information from the woman’s history sub- 
sequent to the first birth. However, the simple measure of whether or not the 
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woman had a miscarriage prior to the first birth proved to be the strongest 
predictor of age at first birth, and this is the measure that has been used in the 
final models. 

6. Other nonwhites were not included. 

7. Age at leaving school was computed by assuming a normal starting 
age, and assuming that education is obtained one year at a time, 

8. To further explore this issue, and to explore whether a gating mech- 
anism existed, we reran the two-stage least squares analysis for women who 
became mothers at a young age. Although caution is necessary in inter- 
preting such an analysis because the variance of the endogenous variables 
has been reduced, age at first birth does not have a significant effect on 
education. 

9. It should be noted, however, that it is possible that, for some women, 
short interbirth intervals prevent the return to school. Virtually nothing is 
known about returning to school after becoming a mother, although there has 
been some research on education after marriage. Approximately one in five 
women attend school after marriage; but the average addition to their educa- 
tional attainment is relatively small: 1.0 years (Davis and Bumpass, 1976). 
Whether this schooling takes place before or after the start of childbearing 
is unknown. In order to minimize the possibility of education after the first 
birth being affected by the pace of fertility, we have primarily used education 
at marriage (rather than education at interview) for this analysis. 

10. We follow the standard convention of indexing birth intervals by the 
order of the fertile pregnancy terminating the interval. Thus, the second birth 
interval is the interval terminated by the second fertile pregnancy. 

11. The results in Table 3 are based on all birth intervals. Thus, both 
wanted or intended intervals and unwanted or unintended intervals are in- 
cluded. To make sure that the relationships shown in Table 3 were not the 
result of differences in fertility intentions, we calculated a set of life ta- 
bles for “intended” intervals, excluding the following two types of intervals: 
(a) closed intervals that were closed by an unwanted birth, and (b) open inter- 
vals where the respondent indicates she does not intend to have another child. 
These results (not shown) are virtually identical to those shown in Table 3. 
Also, in order to see if the finding was sensitive to the particular measure 
of education used, we reran the analysis using respondent’s education at in- 
terview, and then we reran it again using respondent’s husband’s education 
at respondent’s first marriage. These alternative analyses lead to the same 
conclusions. 

12. It should be noted that there is little variance in fertility prefer- 
ences. Three-fourths of the sample gave a preference of 2, 3 or 4. This, of 
course, reduces the possibility of any variable significantly affecting fertility 
preferences. 
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Abstract 


While the possible decline in the level of social capital in the United States 
has received considerable attention by scholars such as Putnam and 
Fukuyama, less attention has been paid to the local activities of citizens that 
help define a nation’s stock of social capital. Scholars have paid even less 
attention to how institutional arrangements affect levels of social capital. 
We argue that giving parents greater choice over the public schools their 
children attend creates incentives for parents as “citizen/consumers” to en- 
gage in activities that build social capital. Our empirical analysis employs 
a quasi-experimental approach comparing parental behavior in two pairs of 
demographically similar school districts that vary on the degree of parental 
choice over the schools their children attend. Our data show that, controlling 
for many other factors, parents who choose when given the opportunity are 
higher on all the indicators of social capital analyzed. Fukuyama has argued 
that it is easier for governments to decrease social capital than to increase 
it. We argue, however, that the design of government institutions can create 
incentives for individuals to engage in activities that increase social capital. 


The delivery of services by local governments involves a complex relation- 
ship between the institutions that supply them and the citizens who use them. 
To improve the delivery of public services, many reformers argue that gov- 
ernments should imitate private markets by increasing the number of suppli- 
ers and by “empowering” citizens to shop across this expanded choice set. 
In this model, “citizen/consumers” become better consumers of public ser- 
vices by becoming more informed about their options and by more carefully 
selecting services that meet their preferences. 

We suggest that the benefits of such market-like reforms can extend be- 
yond the consumer behavior that has been the focus of previous analysis. 
Specifically, we argue that by expanding the options people have over public 
services, citizen/consumers can also become better citizens, and by so doing, 
increase the nation’s stock of social capital. We test this hypothesis in the 
context of public school choice—a set of reforms that increases the control 
parents have over the selection of schools their children attend. These re- 
forms are of long standing in some communities and are emerging in many 
others. In this research, we show that the design of public institutions charged 
with delivering education can affect the formation of social capital. 


Social Capital and Local Citizenship 


An intense scholarly debate recently has emerged concerning the role of so- 
cial capital in economic and political development (e.g., Brehm and Rahn 
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forthcoming; Fukuyama 1995; Granato, Inglehartand Leblang 1996a, 1996b; 
Inglehart 1990; Jackman and Miller 1996a, 1996b; Lipset 1995; Putnam 
1993, 1995a, 1995b; Swank 1996; Tarrow 1996).! One theme in this de- 
bate is that social capital may be important to strong democracies for the 
same reasons that it is important for the functioning of strong economies: 
High levels of social capital engender norms of cooperation and trust, reduce 
transaction costs, and mitigate the intensity of conflicts. 

While political scientists have only recently adopted the concept of so- 
cial capital, the term has been used by sociologists for some time (see, e.g., 
Bourdieu 1980, Loury 1977). Coleman (1988, 1990) brought the term into 
wider circulation and argued (1988, S101) that social capital is generated as 
a byproduct of individuals engaging in forms of behavior that require socia- 
bility. In his study of 20 subnational governments in Italy, Putnam (1993) 
argued that the quality of governance is determined by the level of social 
capital within a region. Fukuyama concurs (1995, 356): 


“The ability to cooperate socially is dependent on prior habits, traditions, and 
norms, which themselves serve to structure the market. Hence it is more likely 
that a successful market economy, rather than being the cause of stable 
democracy, is codetermined by the prior factor of social capital. If the latter is 
abundant, then both markets and democratic politics will thrive, and the mar- 
ket can in fact play a role as a school of sociability that reinforces democratic 
institutions.” 


While comparisons across nations and the identification of trends over 
time are obviously important, less scholarly work has focused on how gov- 
ernment policies affect the stock of social capital. This is especially true 
for the analysis of the formation of social capital at the local level, where a 
small but growing body of work has developed addressing the link between 
government policies and social capital. Stone and his colleagues have been 
examining the role of “civic capacity,’ a concept similar to social capital, in 
local economic development and the politics of education (see, e.g., Stone 
1996). Berry, Portney, and Thomson (1993) examined the importance of 
local community activity in the formation of social capital. And, in the con- 
text of education, Astone and McLanahan (1991), Coleman and Schneider 
(1993), and Lee (1993) have examined social capital as a function of the 
interactions among administrators, teachers, parents, and children. 

We follow the approach of Berry, Portney, and Thomson, who empha- 
size the importance of communities where neighbors talk to each other about 
politics. In these face-to-face meetings, these authors argue that “democracy 
moves politics away from its adversarial norm, where interest groups square 
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off in conflict and lobbyists speak for their constituents. Instead, the bonds 
of friendship and community are forged as neighbors look for common so- 
lutions to their problems” (1993, 3). (Also see Mansbridge 1980 on “uni- 
tary democracy” and Barber 1984 on “strong democracy.”) Berry, Portney, 
and Thomson’s emphasis on “face-to-face” interactions parallels Fukuyama’s 
(1995) focus on “spontaneous sociability” and Putnam’s (1993) emphasis on 
the role of networks and membership in voluntary and social organizations 
as supports for representative democracy (see also the review by Diamond 
1992). 

In this article, we go beyond documenting levels of social capital by 
identifying the effects of institutional arrangements governing the delivery 
of education, the most important public good local governments provide, on 
the formation of social capital. Whereas scholars have recognized the impor- 
tance of schools in creating social capital for the next generation (see, e.g., 
Henig 1994, 201-3), for us, schools are also arenas in which social capital 
can be generated among today’s parents. 

We explore the relationship between schools and social capital by con- 
sidering how school choice can influence parental behavior. Specifically, 
we examine how school choice may increase levels of voluntary parental 
involvement in the schools, face-to-face discussions between parents, and 
levels of parental trust in teachers—behaviors that have all been identified 
as components of social capital. We test these relationships empirically us- 
ing a quasi-experimental design that allows us to isolate the link between 
school choice and citizen behavior. Fukuyama has argued that “social capital 
is like a ratchet that is more easily turned in one direction than another; it can 
be dissipated by the actions of governments much more readily than those 
governments can build it up again” (1995, 62). We show that institutional 
arrangements that increase parental control over the schools their children 
attend may be able to reverse that ratchet. 

Some scholars are skeptical that government policies expanding choice 
can increase social capital For example, Anderson argues that expanded cit- 
izen choice, at best, will cultivate only a “passive understanding” of the de- 
mands of democratic participation and that this “consumer’s skill” is not a 
sufficient basis for “competent citizenship” (1990, 197-8). Carnoy (1993, 
187) and Henig (1994, 222) both argue that school choice will increase the 
social stratification between parents who are more involved and interested in 
their children’s education and those who are not, fundamentally reducing the 
ability of communities to address collective problems. And Handler (1996, 
185) notes that while choice plans require parents to choose, they cannot 
force parents to become actively engaged in school activities. 


406 REPRINTED FROM THE AMERICAN POLITICAL SCIENCE REVIEW 


In contrast, other scholars argue that choice and related reforms will fos- 
ter social capital. As Ravitch (1994, 9) notes: “The act of choosing seems to 
make parents feel more responsible and become more involved.” And Berry, 
Portney, and Thomson (1993,294) cite the shift to parental control over lo- 
cal schools in Chicago in the late 1980s as a rare example of a successful 
attempt to get low-income parents more involved in local public affairs (also 
see Handler 1996). 

In the analysis that follows, we show that reforms introducing choice 
can affect the level of social capital within communities. While our findings 
are limited to one particular aspect of local communities—schools—they 
provide important evidence that government or community-initiated policies 
can indeed ratchet up the preexisting levels of social capital and enhance 
the social fabric necessary for building and maintaining effective democracy. 
And, we demonstrate that this can be done both in suburban communities, 
where most Americans now live, and in inner-city neighborhoods, where the 
stock of social capital may be most depleted and where its absence may have 
the most deleterious effects (e.g., Berry, Portney, and Thomson 1993; Wilson 
1987). 


School Choice 


School choice is perhaps the most widely discussed approach to address- 
ing persistent problems in primary and secondary education in the United 
States. School choice advocates, liberals and conservatives alike, contend 
that changing the institutions governing school organization will improve 
student performance by changing the incentives faced by educators and by 
changing the behavior of students and parents (see Handler 1996, 9).” 

It is possible to define school choice in such a way that it is already 
the norm. Many families already use residential location to choose the pub- 
lic schools their children attend. Even after the residential decision is made, 
many private alternatives to public education are available and about 10% 
of parents nationwide choose that option. School choice, however, is typi- 
cally construed to involve policies that reduce the constraints that traditional 
public schooling arrangements place on schools and students. (For a discus- 
sion of distinctions among choice approaches, see Witte and Rigdon 1993.) 
Most important, school choice policies are designed to break the one-to-one 
relationship between residential location and the schools students attend. 

Responding to intense policy debates and the growing recognition of the 
problems of American schools, over the past two decades a growing num- 
ber of local school districts have changed the institutional frameworks gov- 
erning the provision of local education giving parents expanded choice over 
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the schools their children attend. We take advantage of this diffusion of the 
innovation in school choice policy, employing a quasi-experimental approach 
comparing parental behavior in two pairs of school districts that are demo- 
graphically similar but vary on institutional arrangements. We analyze the 
effects of choice on the formation of social capital in a matched pair of inner- 
city school districts, one with a long history of extensive choice and one 
without much choice. We then replicate this analysis in two suburban school 
districts. In each matched pair, the populations are similar demographically, 
but the institutional arrangements allowing parental choice over the schools 
their children attend differ. 

Our analysis is based on interviews of approximately 300 parents of 
children in public school grades K-8 across four districts. (Appendix A de- 
scribes the sample design.) Two of these are inner-city districts in New York 
City: District 1, which has only recently introduced limited choice, and Dis- 
trict 4, which has offered programs of choice for 20 years. The other two 
are suburban communities in New Jersey: Morristown, which strictly main- 
tains assignment to neighborhood schools, and Montclair, which has had a 
program of choice since the 1970s. 

We begin with a discussion of the two New York school districts, de- 
scribing in detail the evolution of choice in District 4. We then present an 
empirical analysis of effects of choice on social capital in the New York set- 
ting. Finally, we replicate the analysis using our New Jersey sample. 


District 4: A School Choice Innovator 


District 4 is located in East or “Spanish” Harlem, one of the poorest com- 
munities in New York City. The district serves roughly 12,000 students from 
pre-kindergarten through the ninth grade. In the early 1970s, the district’s 
performance was ranked the lowest of 32 city public school districts in math 
and reading scores. Choice was part of a response to this poor performance. 

Fliegel (1990) described the evolution of school choice in District 4 
as resulting from “creative noncompliance” with New York City rules and 
regulations. The factors shaping the emergence of the District 4 can be 
traced back to the late 1960s when the administration of New York City’s 
public school system was decentralized to allow for greater community con- 
trol. Thirty-two separate community school districts were established, each 
of which was governed by an elected community school board and by the 
central Board of Education. High schools remained under the authority of 
the Board of Education. Decentralization was supposed to promote greater 
parental participation, but it has also led to problems with corruption, over- 
politicization, and poor performance (Cookson 1994, 50-1). 
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District 4 took full advantage of decentralization, in large part due to 
the entrepreneurial efforts of Anthony Alvarado, district superintendent from 
1972 until 1982. As Boyer (1992, 41-2) notes, Alvarado bent rules, attracted 
outside grants, and won support from powerful teacher and principal unions. 
When Alvarado took over as superintendent, District 4 ran 22 schools in 22 
buildings. In 1974, the first alternative school, Central Park East Elementary, 
was developed, followed by an alternative program for seventh and eighth 
graders with serious emotional and behavioral problems and by the East 
Harlem Performing Arts School, a program for fourth through ninth graders. 
These schools were open to parental choice and, as minischools, they were 
located within existing buildings where space was available. These schools 
were given greater flexibility over staffing, use of resources, organization of 
time, and forms of assessment. 

The differences between the administration of these alternative schools 
and the traditional schools led to complaints of favoritism from some teachers 
and principals in the traditional schools. In response, new opportunities were 
offered to develop alternate schools using funding from the Magnet Schools 
Assistance Act (Wells 1993, 56). The district also exceeded its annual budget 
for many years as these alternative schools were being developed (Henig 
1994, 164). 

The focus on educational goals was shaped by Seymour Fliegel, appoin- 
ted District 4’s first director of alternative schools in 1976, who developed 
small schools designed to provide students, parents, and professional staff 
with flexibility and a sense of school “ownership” (Fliegel 1990, 209). Fliegel 
also used choice to encourage this sense of ownership. During the late 1970s 
and the 1980s more than 20 alternative schools were developed, many with 
distinctive curricular themes. As the number of schools increased, the differ- 
ences between schools became more apparent. With many new schools and 
the potential for parents and students to make meaningful choices, Smith and 
Meier (1995, 94) suggest that it “became hopeless” to tell parents or teachers 
that their assignments would be determined bureaucratically. Thus, in 1982, 
the district decided to provide all parents with choice. Sixteen neighborhood 
elementary schools remained intact, with space reserved first for those living 
in the designated zones. While the emphasis was placed on providing choice 
at the junior high school level, the district also created a considerable number 
of alternative elementary schools, many of them bilingual (Smith and Meier 
1995, 94). 

In District 4, all students must make an explicit choice about the 
junior high school they will attend. Each sixth-grader receives a copy of a 
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booklet describing the alternative junior high schools. Parents and students 
attend orientation sessions led by the directors of various alternative schools 
and are encouraged to visit the schools (Wells 1993, 55). Students and 
their parents rank and discuss their six choices of junior high schools. Sixty 
percent of the students in the district are accepted into their first-choice 
school, 30% into their second-choice school, and 5% into their third-choice 
school. The remaining 5% are placed in schools thought to be most appro- 
priate for them (Boyer 1992, 52-3). To ensure that all students have viable 
choices, District 4 administrators monitor the popularity of the various alter- 
native schools, closing or restructuring less popular schools (Wells 1993, 55). 


District 1: Limited Choice 


Our other New York City research site is District 1 on Manhattan’s Lower 
East Side. Largely Hispanic and poor, the residents of District 1 share many 
characteristics with those of District 4. District 1 was created out of the Two 
Bridges School District, one of most active districts in New York City’s fights 
over school decentralization in the 1960s. Despite this high initial level of 
community activism, the schools have foundered over the years. Following 
the success of District 4, District 1 began experimenting with school choice, 
and in 1992 created a small number of alternative schools.* 

As aresult of entrepreneurial efforts to develop choice, District 4 has de- 
veloped a reputation in the city and in the nation as an innovative, successful 
district. A sense of mission is evident among parents, teachers, and admin- 
istrators. While there is some dispute about how much of the success can 
be attributed to choice per se (see Henig 1994, 12444), there is no question 
that performance in District 4 improved from its original low level as choice 
was implemented. In contrast, despite the high level of community activi- 
ties during the push for decentralization, District 1 has faced considerable 
administrative turnover and turmoil for the last few years. 

We report some comparative data on the districts in Table 1. Both dis- 
tricts are geographically compact, have large numbers of students from very 
poor families (more than eight of ten students are eligible for free lunches), 
and have a majority Hispanic student population. 


The Survey Respondents 


We contracted Polimetrics Laboratory for Political and Social Research, a 
survey research facility at Ohio State University, to interview 400 residents 
in each district in spring 1995, sampling parents (or the person in a household 
who “makes the decisions about the education of children”). To focus on the 
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Table 1. District 4 and District 1 Population and Sample Demographics 


District 4 District 1 
Population Sample Population Sample 

Number of students 13,806 333 12,519 295 

Number of schools 50 46 24 24 

Hispanics 63% 68% 63% 71% 
Blacks 33% 26% 12% 11% 
Whites 2% 2% 10% 10% 
Asian 1% 1% 13% 2% 
Percentage in poverty 54% NA 49% NA 

Income < $20,000 per year NA 67% NA 66% 
Employed 35% 38% 48% 43% 
High school degree or more 48% 65% 63% 65% 
Single parent NA 61% NA 46% 
Female 50% 90% 55% 87% 


Source: For district information: School District Data Book Profiles. 1989-90. 
NA: Since both districts are administrative units for the New York City school system rather 
than, e.g., census designated units, some demographic data are not available. 


schools controlled by the districts, the sample frame was limited to parents 
with children in grades K-8.> To randomize, respondents were asked to ans- 
wer school-specific questions based on the experience of their child in grades 
K-8 whose birthday came next in the calendar year. 

As Table 1 illustrates, the sample of public school parents in each district 
is fairly representative of the student population on many key demographic 
variables. (We chose to interview parents of children who live in the districts 
but attend private schools as these parents are exercising a form of choice. 
However, they are not included in the analyses presented below. In District 1, 
26% of the respondents sent their child to private school, compared to 17% 
in District 4.)° 

Overwhelmingly, we sampled females, both because there are many sin- 
gle mothers in these districts and because we asked to speak with the person 
in the family who makes the decisions about school. More than 60% of the 
households were headed by a single parent in District 4, compared to 46% 
in District 1, and in both districts, more than 85% of the respondents were 
female. 


Constructing the Models 


With this background in place we now turn to our major goal: to assess the 
degree to which giving parents more control over the schools their children 
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attend increases their level of social capital. In our analysis we use four 
measures of social capital, three of which are directly derived from Putnam 
(1993) and Fukuyama (1995) and the fourth a logical extension. 

The first measure is whether the parent is a member of the PTA. Putnam 
uses declining participation in PTAs as one of his indicators of the erosion of 
social capital.’ Second, we analyze a slightly broader measure of parental in- 
volvement in the schools, asking parents if in the past year they had engaged 
in any volunteer activities for their child’s school. The third measure we in- 
vestigate is the number of other parents our respondent talked with about 
school matters. We use this measure to reflect the “spontaneous sociabil- 
ity” Fukuyama emphasizes as underlying social capital and the importance of 
“face-to-face democracy” emphasized by Berry, Portney, and Thomson 
(1993). Our final measure reflects the level of trust parents have in their 
child’s teacher to do the “right thing” for their child.’ For Fukuyama the gen- 
eral level of trust in society is the critical dimension of social capital, since 
it lubricates economic, political and social transactions. In this research, we 
concentrate on a single domain-specific dimension of trust (trust in teachers). 
These activities not only are central to building social capital, they are also 
critical to building good schools (see, e.g., Anson et al. 1991). 

In our selection of independent variables, we measure elements of mo- 
tivation, resources, time constraints, and school policies that Kerbow and 
Bernhardt (1993, 116) argue are critical features of parental involvement in 
the schools. Thus we employ variables related to individual demographic 
characteristics as well as those related to the schools children are attending. 

Three different types of institutional arrangements exist in the two cen- 
tral city districts in our study. The oldest and most traditional form of school 
organization is the neighborhood model, in which children are assigned to 
schools based on residential location. The second is universal choice, which 
characterizes the intermediate school system (grades 6-8) in District 4. Un- 
der this type of arrangement all parents must choose a school for their chil- 
dren (i.e., there is no “default” school). Finally, an “option demand” sys- 
tem of choice (see Elmore 1991), which exists in both districts but is much 
more developed in District 4, allows parents to select a school other than their 
neighborhood school. We refer to those parents who have decided to exercise 
choice as “active choosers.” About 20% of our sample fall into the universal 
choice category (all in District 4), while about 9% of all of the sampled par- 
ents in New York are active choosers. 

Active choosers present us with the same fundamental problem faced 
by any research on the behavior of parents in school choice settings—parents 
choosing alternative schools may not be a random selection of all parents in 
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a school district. And, if parents who self-select alternative schools are also 
high on social capital then our results will be biased. While other studies have 
acknowledged this problem and made various efforts to control for selection 
bias (Chubb and Moe 1990; Coleman and Hoffer 1987; Coleman, Hoffer 
and Kilgore, 1982; Smith and Meier 1995), we correct for it by construct- 
ing a two-stage nonrandom assignment model, in which the first equation 
models the assignment process and the second equation the “outcome.” The 
method, described in Appendix B and based on the work of Heckman (1978), 
Heckman, Hotz, and Dabos (1987) and Lord (1967, 1969), corrects for both 
the nonrandom selection process and other econometric problems associated 
with the use of dichotomous dependent variables (see Achen (1986) and Al- 
varez and Brehm (1994) for discussions of the applicability of this method in 
political science ).? 

By limiting the possibility that parents likely to make active choices 
are also likely to engage in other activities that we refer to as part of social 
capital, the use of this methodology is critical to our argument that making 
an active choice influences parental behavior. 

As noted in detail in Appendix B, we begin with an explicit assignment 
equation: 


Active choosers = a + B[Demographics] + B[Values] 
+ B[Diversity] + error, (1) 


where Active choosers is a dichotomous variable indicating whether a par- 
ent has elected an alternative school or program for their child (1 = yes, 
0 = no); Demographics is a vector consisting of a set of dummy variables 
for self-identified racial group membership (black, Hispanic, Asian—white 
is the excluded category), a continuous variable measuring years of schooling 
of the parent, a continuous variable reflecting the length of residence in the 
school district, and a 7-point scale measuring frequency of church attendance 
(1 = never, 7 = once per week). We also include two dichotomous variables 
reflecting the gender of the respondent (1 = female) and whether or not the 
respondent is employed (1 = yes). The racial, gender, and employment vari- 
ables reflect the resources and demographic factors that may influence ac- 
tivities related to social capital. Parental education level may be particularly 
important—Putnam (1995b, 667) reports that it “is by far the strongest corre- 
late... of civic engagement in all its forms.” The length of residence variable 
reflects the argument advanced by Brehm and Rahn (forthcoming) and by 
Putnam, who both argue that mobility decreases social capital. In addition, 
Teske et al. (1993) found that length of residence affected knowledge of 
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school policies. Church attendance is a control variable representing an al- 
ternative form of interaction and involvement with the local community. 

The Values and Diversity variables indicate whether a parent thought 
either particular values or diversity as school attributes were important in 
their choice of schools. In our survey parents were asked to name up to four 
attributes they thought were most important in a school. Two attributes in 
particular, the values espoused by the school and the diversity of the student 
body, were considered important by parents of children in alternative pub- 
lic schools but not by parents of children in neighborhood public schools.!° 
We therefore include these variables in the assignment equation for theoret- 
ical reasons, as they are important predictors of active school choosers. We 
have no theoretical reason, however to expect these variables to affect social 
capital and, indeed, they are not empirically related to the activities we have 
measured. These are used as exclusions in our outcome equation and provide 
the necessary leverage for estimating the system of equations. !! 

Thus, as described in greater detail in Appendix B, we estimate this 
assignment equation and the predicted value of the active chooser variable is 
used in estimating the following outcome equation: 


Social capital = a + B[“Predicted” active choosers] 
+ B[School factors] + B[Demographics]+ error, (2) 


where Demographics are as noted in equation (1) and Values and Diversity 
are excluded. “Predicted” active choosers is the estimated values from equa- 
tion (1), transformed into a linear functional form following Goldberger (1964; 
also see Achen 1986, Heckman 1978). School factors measure other as- 
pects of the school environment. These factors include a variable measur- 
ing the enrollment in the school the child attends, as smaller schools are 
often considered to be better arenas for building social capital (Harrington 
and Cookson 1992); a dummy variable ( = 1) when the respondent had 
made a universal choice at the junior high level in District 4; and a mea- 
sure of parental dissatisfaction with her child’s school.!* Previous research 
(e.g., Witte 1991) has demonstrated that parental dissatisfaction is negatively 
correlated with levels of parental involvement and participation in school 
activities. 

When the dependent variable in the outcome equation is continuous, as 
in our analysis of the number of parents with whom a respondent has talked 
about schools, the two-stage estimation technique is fairly straightforward. 
When the dependent variable is a dichotomous variable, however, another 
round of corrections is necessary because the disturbances are heteroskedas- 
tic (see Appendix B; also see Achen 1986, 40-7). In our analysis of the other 
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three measures of social capital we report these generalized two-stage least 
squares (G2SLS) results. Note that since the results are generalized linear 
probability estimates, the coefficients have a straightforward interpretation: 
They represent the change in the probability of finding an event given a unit 
change in the independent variable. 


The Effects of Choice in the Central City 


With these corrections in place, we are now able to estimate the effects of 
school choice on the behavior of parents controlling for the nonrandom “as- 
signment” across alternative schools.'? We present the results in Table 2. 
Turning first to PTA membership, reported in the first column, we find strong 
evidence that school choice affects this widely used measure of social capi- 
tal: Ceteris paribus, participation in the PTA among active choosers is 13% 
higher than among nonchoosers (p < .05), the largest effect in our model, 
apart from gender. 

The effects of some other variables are worth noting. First, note that as 
the length of residence increases, so does participation in the PTA (p < .05, 
using a one-tail test). Similarly, frequency of church attendance increases 
participation in the PTA. These findings confirm empirically the arguments 
presented by Putnam and Fukuyama, as well as findings by education re- 
searchers (Kerbow and Bernhardt 1993, Muller and Kerbow 1993). Note too 
that participation in the PTA increases with the level of parental education— 
individual human capital and social capital flow together. 

In the second column of Table 2, we turn to more general patterns of par- 
ticipation in voluntary events. Here we find that active choosers are over 12% 
more likely to engage in such activities than are nonchoosers. Paralleling 
the results reported for PTA membership, church attendance and longer resi- 
dence are associated with volunteering, as is more years of parental education. 

We have shown that active participation in school choice increases lev- 
els of involvement with voluntary organizations. We turn next to a measure 
of “spontaneous sociability” —how many other parents do our respondents 
engage in discussions about schools? The same cluster of variables emerges 
as important: Ceteris paribus, active choosers talked with four more parents 
than nonchoosers (see the third column of Table 2). Again, longer term res- 
idents, more educated respondents, and frequent churchgoers talk with more 
parents than do other respondents. 

Finally, we examine trust in teachers. As shown in the final column of 
Table 2, school factors dominate this model. Active choosers are almost 10% 
more likely to trust teachers all or most of the time and universal choosers are 
9% more likely to do so. In contrast, parents who are dissatisfied with their 
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Table 2. The Effects of Choice on the Formation of Social Capital in Two 
New York Districts 


PTA Voluntary Parents Trust 
Member Activities Talked to Teacher 
Active chooser .128* .123* 4.053* .095* 
(.064) (.064) (2.295) (.049) 
Universal choice —.035 .025 —.613 .096* 
(.066) (.062) (.651) (.056) 
Dissatisfaction —.042 —.003 .234 —.239** 
(.041) (.040) (.404) (.039) 
School size —.000 —.000 —.000 .000 
(.000) (.001) (.001) (.000) 
Black .092 .048 —.401 —.057 
(.072) (.068) (1.30) (.044) 
Hispanic —.068 —.021 .419 —.066 
(.066) (.062) (1.22) (.036) 
Asian .041 .149 1.61 .059 
(.187) (.157) (2.47) (.097) 
Length of residence .005* .005* .085** —.002 
(.003) (.003) (.030) (.002) 
Education .015** .020** .148* —.009* 
(.005) (.006) (.063) (.004) 
Employed —.046 031 .038 .033 
(.044) (.042) (.427) (.029) 
Female QI .110 370 —.052 
(.056) (.067) (.708) (.036) 
Attend church .O41*** 023""* .242* .010 
(.009) (.009) (.108) (.006) 
Constant 336" 327 .739 1.05 
(.129) (.135) (2.34) (.090) 
N = 580, N=580, N=568, N=578, 
F =66 F = 107 F = 4.4 F=4.3 


Note: Numbers in parentheses are adjusted standard errors. We do not report 
R-squared statistics because in the adjustment process necessary to correct for 
the nonrandom assignment problem, this statistic becomes inappropriate (see 
Aldrich and Nelson 1994, 14-5). 

* p < .05; ** p < .01; ** p < .001 
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child’s school and have considered moving the child to a different school are 
24% less likely to trust their child’s teachers. Of the demographic factors, 
only education is related to trust—but this relationship is negative. 

Note also that while choosing significantly increases social capital on 
all four dimensions we measure, school size is not related to any of these 
measures. Harrington and Cookson (1992) have argued that the introduction 
of smaller schools in District 4 was the most important innovation accounting 
for the improvements found in the district. Our results differ—it is choice and 
not school size that matters. 


Taking Advantage of the Quasi-Experimental Design: Replicating 
the New York Findings 


Replication is one of the most powerful tools available for validating so- 
cial scientific findings. In the next stage of our analysis, we take advantage 
of our quasi-experimental design to replicate the results of our New York 
study in another pair of school districts. This replication allows us to explore 
the robustness of our findings by testing their sensitivity to changes in the 
context of choice. In our next comparison, we explore the effects of com- 
munity composition on our findings. In our first analysis, we demonstrated 
that school choice fosters behavior that builds social capital among parents 
in low-income central city school districts. Given the multitude of problems 
facing central cities, this is obviously an important finding. The next ques- 
tion is obvious: Does this relationship hold among suburban parents who 
now make up a larger share of the American population than do those in the 
central city? 

Second, and more important for us, the institutional factors that define 
the extent of school choice varies across our two sets of communities. In our 
next “experiment,” we compare patterns of activities in a traditional neigh- 
borhood school district (where no one can choose a school except by chang- 
ing their residential location or by opting out of the public sector altogether) 
with those in a universal choice district (where there are no neighborhood 
schools). These institutional arrangements represent more extreme points on 
the policy continuum than do those in District 1 and District 4. Are the re- 
sults we found in New York replicated under these different community and 
institutional conditions? Are the magnitude of the effects similar? 


School Choice and Social Capital in Suburban Communities 


To answer these questions we turn to our second paired set of communities, 
Montclair and Morristown, New Jersey, two suburban communities within 
commuting distance of New York City. Given the institutional arrangements 
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governing the schools in these two districts, we can test the effects of univer- 
sal choice directly, since everyone in Montclair’s public schools chooses and 
no one in Morristown’s can. 


Montclair and Morristown, New Jersey 


In both communities, court-ordered desegregation decisions in the 1970s 
led to fundamental changes in the school assignment mechanisms; however, 
very different responses were developed to achieve racial balance. Montclair 
adopted school choice, with parents given the right to choose schools from 
kindergarten through the eighth grade (there is only one high school), with 
choice constrained by racial balancing. In Morristown, residential zones 
were created for neighborhood schools. These zones are frequently adjusted 
so that each school in each zone has the same racial balance, but once set the 
zones are strictly enforced. 

School choice has been operating in Montclair for about as long as in 
District 4. In 1969, the New Jersey Commissioner of Education ordered 
Montclair to desegregate or lose state funding. A forced busing plan was im- 
plemented in 1972, which caused conflict and considerable white flight. A 
limited choice program was implemented in 1975 to try to encourage vol- 
untary racial balancing by establishing magnet schools. Several changes 
were made to the choice plan in Montclair, and in 1984 choice was intro- 
duced to the whole district by the symbolic act of turning all schools into 
magnets. 

While choice was initially a solution to racial balancing, parents, teach- 
ers, and administrators used it to promote competition and better schools 
(Boyer 1992, 33). Parents in Montclair are provided with considerable infor- 
mation about the schools. In choosing schools, parents request two options 
and students are placed in their first choice if it matches the racial balanc- 
ing goals. The schools are nearly uniformly good and about 95% of parents 
receive their first choice (Strobert 1991, 56-7). Between 60 and 80% of stu- 
dents are bused to their schools, but now such busing is voluntary. 

Table 3 shows the demographics of the public school parents in these 
two New Jersey districts, overall and for our surveyed sample of 400 parents 
in each community. 

Under the universal system of choice in Montclair, all parents are re- 
quired to choose a school for their child. Therefore, it is not necessary to 
specify the selection process as we did for the analyses of our New York City 
parents—that is, no assignment equation is needed and the extensive correc- 
tions noted in Appendix B are not necessary. Thus, the results reported in 
Table 4 are the results of straightforward multivariate analyses. For compa- 
rability with the linear probabilities reported in our analysis of New York, 
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Table 3. Montclair and Morristown Population and Sample Demographics 


Montclair Morristown 


Population Sample Population Sample 


Number of students 5850 356 5080 286 
Number of schools 10 10 9 9 
Hispanics 4% 3% 9% 71% 
Blacks 36% 34% 17% 16% 
Whites 56% 57% 70% 70% 
Asian 3% 1% 4% 5% 
Percentage in poverty 71% NA 6% NA 
Income < $20,000 per year 16% 8% 21% 14% 
Employed 59% 80% 58% 71% 
High school degree or more 88% 98% 86% 94% 
Single parent 11% 23% 23% 22% 
Female 54% 78% 53% 716% 


Source: For district information, School District Data Book Profiles, 1989-90. 


we report the percentage point change for a unit change in the independent 
variable (for the dummy variable, this is the effect of having the character- 
istic [1] versus not having it [0]). Since all Montclair parents must choose 
their children’s school and no one in Morristown public schools can choose 
(except by moving), the coefficient of the dummy variable for Montclair rep- 
resents the effects of universal choice, ceteris paribus. 

The results in Table 4 show patterns consistent with those in our New 
York analysis. Choosers are significantly more likely to engage all measures 
of social capital—PTA membership, volunteering for a school activity, talk- 
ing to people about schools, and trusting teachers—controlling for other im- 
portant factors. !* 


School Choice Can Help Build Social Capital 


At the heart of calls for the introduction of market-like reforms into the pub- 
lic sector lies the belief that giving people choices over public goods will 
increase efficiency. Research into the effects of reforming the “supply side” 
of the provision of public goods has established that such competitive mech- 
anisms can in fact pressure the producers of public goods to be more effi- 
cient and more responsive (for local public goods, see, e.g., Ostrom 1972, 
Schneider 1989, Schneider and Teske 1995, Tiebout 1956). Recently, schol- 
ars have begun to study the effects of reforms on the demand-side of the 
market, leading to debates about the level of information held by citizens 
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Table 4. The Effects of Choice on the Formation of Social Capital in Two 
New Jersey Districts 


PTA Voluntary Parents Trust 
Member Activity Talked To Teacher 
(standard % (standard % (standard % (standard % 
error) Change error) Change error) Change error) Change 
Universal choice 0.35** 13% 0.21* 6% 1.24" 13% 0.28* 6% 
(11) (13) (38) (14) 
Black —0.55** -21% O48 -14% -3.38* -30% —-0.41** -9% 
(13) (14) (44) (15) 
Hispanic —1.24** 45%  —0.96™* -34% -2.86* -12% 0.34 6% 
(29) (26) (91) (38) 
Asian 0.57 -22% 0.15 4% -3.49** -11% 0.49 8% 
(.33) (.39) (1.17) (.55) 
Length of —0.01 -0.07% 0.02** 0.6%  0.07%** 9% 0.01 0.02% 
residence (.01) (.01) (.03) (.01) 
Education 0.09** 3% 0.06** 2% 0.31°* 16% 0.03 0.5% 
(.02) (.02) (.08) (.03) 
Employed —0.07 -3%  —-0.06 -1%  -0.78* 6% 0.27 -5% 
(14) (16) (47) (18) 
Female 0.40** 15% 0.52** 16% 1.22** 10%  -0.02 0.5% 
(13) (19 (44) (16) 
Attend church 0.09** 4% 0.06* 2% 0.24" 11%  -0.01 0.01% 
(.03) (.03) (.08) (.03) 
Dissatisfaction —1.76** -8%  —0.01 -0.1% 0.51 6%  -0.73* -18% 
(42) (14) (41) (.14) 
Constant 0.92 —0.45 1.71 1.04 
(41) (44) (1.4) (49) 
N = 629 N = 629 N = 626 N = 622 
x? =91 x? =61 F=14 7 =43 
(.00) (.00) (.00) (.00) 


Note: In the three probit equations the percentage point change figures indicate the effect of a change 
from 0 to 1 for the dummy variables and represent the effect of a unit change for the nondummy variables. 
For the regression equation (parents talked to) the percentage changes are calculated from the normalized 


beta coefficients. 


* p < .05; “p<.01; ** p < .001 


and the levels necessary for markets for public goods to work (e.g., Lowery, 
Lyons, and DeHoog 1995, Lyons, Lowry, and DeHoog 1992, Teske et al. 
1993, 1995). This debate has focused on only a limited aspect of the be- 
havior of the “citizen/consumer” in the market for public goods, revolving 
around the question of whether competition can enhance the behavior of 
citizens as consumers. We broaden the question by asking if government 
policies that enhance choice over public goods can increase the capacity 
of the citizen/consumer to act as a responsible, involved citizen. Our re- 
sults show that in the domain we study, local public education, the answer 


is yes. 


420 REPRINTED FROM THE AMERICAN POLITICAL SCIENCE REVIEW 


According to Putnam, societies can evolve two different equilibria as 
they solve collective action problems. One equilibrium is built on a 
“virtuous circle” that nurtures healthy norms of reciprocity, cooperation, and 
mutual trust. The other relies on coercion and creates an environment in 
which only kin can be trusted. Civic engagement is at the core of Putnam’s 
concept of social capital because it breeds cooperation and facilitates coordi- 
nation in governing. Public schools constitute a domain in which the virtuous 
circle is essential for improving the quality of education. Hillary Rodham 
Clinton (1996) has argued that “it takes a village” to raise a child. It may 
also take a “village” to educate a child: High quality education is dependent 
on parental involvement supported by high levels of community involvement. 
In turn, higher quality education is associated with activities that build social 
capital—a virtuous circle is created. 

Our research shows that the design of the institutions delivering local 
public goods can influence levels of social capital. No present statistical 
method can fully correct for problems in estimation introduced by the com- 
plex causal linkages that motivate our study. Our two-stage modeling, how- 
ever, clearly addresses the biases introduced by the nonrandom “assignment” 
of parents as active choosers in New York. Our research shows that in both 
an urban and a suburban setting and under different institutional settings of 
choice, the act of school choice seems to stimulate parents to become more 
involved in a wide range of school-related activities that build social cap- 
ital. Our results support arguments linking participation and urban democ- 
racy and, within the domain of schools that we studied, are directly congruent 
with Berry, Portney, and Thomson’s (1993, 254) claim that “increased par- 
ticipation does lead to greater sense of community, increased governmental 
legitimacy, and enhanced status of governmental institutions.” 

Clearly, many factors affecting the formation of social capital are 
individual-level characteristics effectively beyond the control of government 
(e.g., social capital increases with church attendance and with length of res- 
idence in a community). This fundamentally limits the role that govern- 
ment can play in nurturing the formation of social capital. Despite this, 
we believe that governmental policies can and do affect the level of so- 
cial capital. The careful design of governmental institutions may be able 
to reverse the ratchet that Fukuyama believes has only driven social capital 
down. 


Appendix A: Survey Methodology 


We contracted the Polimetrics Research and Survey Laboratory at Ohio State 
University to carry out the survey. To start, Polimetrics identified the zip 
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codes in each of the four school districts. All listed telephone numbers for 
each zip code were identified. From this, a list was developed using random 
generation of the last two digits of the appropriate telephone exchanges, so 
that unlisted numbers were included as well. All known business telephone 
numbers were removed as they were not eligible to be interviewed. Then, a 
random sample was taken of the remaining numbers. 

To be eligible to be interviewed, respondents needed to live within the 
school district, have children between grades K-8, be the adult responsible 
for decisions affecting that child’s education, and identify the school their 
child attended (which could be either a private school or a district public 
school). 

The actual interviews were conducted from March through June 1995. 
The interviewers were given extensive training and some interviews were 
conducted in Spanish. Interviews were monitored randomly and, to ensure 
validity, 15% of all completed interviews were verified with respondents by 
the supervisors. 

The goal was to obtain 400 completed interviews in each of the four 
districts. The following table shows the call dispositions in each district. 


Table A-1. Disposition of Survey Telephone Calls 


District 4 District 1 Montclair Morristown 


Completed 400 401 408 395 
Refusals 113 522 109 174 
No final disposition 225 1,642 281 343 
Nonhousehold 5,237 17,883 5,268 12,913 
Ineligible 5,722 13,469 3,935 5,918 


Appendix B: Correcting for Nonrandom Assignment 


As Achen (1986) demonstrates, ordinary regression fails to produce unbiased 
estimates of treatment effects in quasi-experiments when the “assignment” 
to different conditions is not random (see LaLonde and Maynard 1987; Lord 
1967, 1969; Heckman 1978; Heckman, Hotz, and Dabos 1987). Consequen- 
tly, in addition to specifying the behavioral outcome, we must explicitly 
model the assignment process. To deal with the dichotomous nature of three 
of our dependent variables, we apply Achen’s generalized two-stage least 
squares estimator (G2SLS). The steps for this estimation procedure, as well 
as the standard 2SLS we employ to estimate our continuous outcome equa- 
tion, are summarized below. 
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Table B-1. Assignment (First-Stage) Equation: Active Public 
School Choosers in New York 


Coefficient Standard Error 


Diversity .090* .038 
Values 115** .037 
Length of residence .005* .002 
Years of schooling .006 .004 
Black =,252""" 048 
Hispanic —.216*** .045 
Asian —.318** .116 
Employed .079** .028 
Female —.003 .043 
Attend church —.003 .006 
Constant 127 114 


*p < 05; **p < .01; **p < .001 
N = 584; F(10, 573) = 10.36; p = .000 


The first stage consists of estimating the assignment equation. This can 
be done in a straightforward manner by applying the linear probability model. 
Goldberger’s (1964) two-step weighted estimator can be employed to correct 
for the problems of ordinary least squares (OLS) regression with a dichoto- 
mous dependent variable. Before calculating the weights, the predicted val- 
ues outside the 0-1 interval from the OLS regression should be reset to the 
bounds. It should also be noted that in order for the system of equations to be 
estimated, at least one variable in the assignment equation must be excluded 
from the outcome equation. This variable provides the necessary statistical 
leverage to estimate the system, so its coefficient in the assignment equation 
must be nonzero. See Table B-1 for the results of the assignment equation. 

For the second stage, the forecast values of the treatment variable (the 
dependent variable from the assignment equation) are inserted into the out- 
come equation. When the dependent variable in this equation is continuous 
(as in the case of our “spontaneous sociability” model) ordinary regression 
can be applied. The resulting coefficients are 2SLS estimates. The only re- 
maining step in the continuous variable case consists of correcting the stan- 
dard errors of the coefficients. To accomplish this we first denote the variance 
of the residuals from our OLS regression œ. Next we generate a new forecast 
value for the dependent variable by using the second-stage coefficients and 
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the original variables. We then compute the variance of the new set of resid- 
uals, o°, by taking the difference between the two equations. The standard 
errors of the 2SLS coefficients are corrected by multiplying each standard 
error by the square root of o? /œ?. 

If the dependent variable in the outcome equation is dichotomous, as in 
our three other models, additional steps are necessary. Once again we insert 
the forecast values of the treatment variable into the outcome equation. After 
applying OLS to the outcome equation we compute a new forecast value 
for the dependent variable using the regression coefficients and the original 
variables. Once again, predicted values outside the 0-1 interval are reset to 
the bounds. Next we apply Goldberger’s two-step weighted estimator to the 
outcome equation. The coefficients of the final estimation are the 2GSLS 
estimates, but again, the reported standard errors are wrong. To correct them 
we first denote the variance of the residuals from the final stage regression 
as œ. We then multiply each standard error by the square root of 1/w*. We 
report these corrected coefficients and standard errors in our tables. Note 
too that once these corrections are implemented the R? statistic is no longer 
meaningful and is not reported for any of our New York models. 
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Notes 


1. The current debate in political science is focused on somewhat dif- 
ferent issues than we address here. However, our research is directly relevant 
to one central theme of that debate—the role of government in creating social 
capital. In critiquing what he sees as a critical omission by Putnam (1993), 
Tarrow (1996, 395) asks: “Can we be satisfied interpreting civic capacity as 
a home-grown product in which the state has no role?” Similarly, Jackman 
and Miller (1996a, 655) argue that a political institutional approach that en- 
dogenizes civic culture can help explain differential political and economic 
development. 

2. Classic theoretical treatments include: Chubb and Moe 1990; Coons 
and Sugarman 1978; Fantini 1973; Friedman 1955, 1962; Jencks 1966. For 
reviews of school choice in practice, see Cookson 1994; Clune and Witte 
1990; and Wells 1993. 

3. These policies include publicly provided vouchers that can be used in 
a variety of schools, both public and private (see, e.g., Lee 1991), the intro- 
duction of magnet schools (see, e.g., Blank 1990), the introduction of charter 
schools (see, e.g., Wohlstetter, Wenning, and Briggs 1995), and public school 
choice plans such as those we analyze here. 

4. In 1993, the New York City Board of Education established a new 
policy of interdistrict choice. If space is available (usually it is not), students 
can go to schools outside of their district. The Board did not mandate choice 
programs within districts. 

5. Recall that high schools in New York are run by the central Board of 
Education. 

6. The table in Appendix A shows that telephone interviewers had 
greater difficulty completing interviews in District 1 than District 4; however, 
as evident in Table | our samples of public school parents are nonetheless 
representative of the population of the districts as a whole. 

7. We recognize a limitation inherent in the cross-sectional nature of our 
research design. Ideally, research on changes in social capital would employ 
a longitudinal, interrupted time-series analysis, involving panel responses. 
In this ideal research design, data would be collected prior to institutional 
changes and, by interviewing the same subjects over time, researchers could 
isolate the specific effect of institutional changes. Unfortunately, few re- 
searchers had the foresight or the resources to conduct such a study; trade- 
offs must inevitably be made. For example, Putnam (1993) used aggregate 
level and (some would say) problematic measures of social capital (see, e.g., 
Jackman and Miller 1996a) and went beyond his data to explore historical 
differences in the development of Italian regions. The trade-off in our case is 
that while we can not gather detailed individual-level data on parents in these 
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districts before they chose a school, we do have detailed individual mea- 
sures today that our cross-sectional design allows us to test while controlling 
for individual-level demographic and socioeconomic factors. With replica- 
tion across four different institutional settings, our quasi-experimental design 
provides a strong cross-sectional test of the causal relationships postulated in 
the existing social capital literature. 

8. Participation in the PTA and in voluntary activities is a dichotomous 
variable, with 1 indicating membership in the PTA (52% report membership) 
or voluntary activity (66% report such activity). As Verba, Schlozman, and 
Brady (1995, 74-9) note and our data confirm, levels of voluntary activity in 
social organizations are considerably higher in America than is participation 
in electoral activities. The number of parents a respondent reported talking 
with is a continuous variable based on the midpoints of categories presented 
(mean = 4.5; s.d. = 4.6). Trust in teachers is operationalized as a dichoto- 
mous variable (1 = trusts teachers most of the time or always [77% report 
this level of trust]; O = never or only sometimes). 

9. While it is also plausible that there could be a two-way or reciprocal 
relationship between social capital and school choice, the timing of our re- 
search design makes this unlikely: Parents made their school choice in spring 
1994. They were not interviewed until spring 1995, during which time they 
answered questions about activities during the previous school year. Thus, 
they chose first and engaged in the activities we measured later. 

10. Smith and Meier find that religion and race help explain why some 
parents choose private schools for their children (1995, 71-2). Our values 
and diversity variables for the public schools are closely related to these 
concepts. Alternative schools in New York tend to emphasize themes and 
pedagogical approaches that are based on particular social, educational, or 
civic values. Diversity has a somewhat different meaning in districts where 
two-thirds of the children are Hispanic. 

11. To estimate two stage models there must be at least one exclusion in 
the assignment equation. In other words, we must find at least one variable 
that significantly influences assignment but not the outcome (Achen 1986, 
38). We use these two variables, diversity and values, as exclusions. 

12. Our specific measure, indicating whether or not the parent has often 
thought about moving her child to another school, is a dummy variable coded 
1 = yes, the parent has thought about moving her child to a different school. 
We expect a negative relationship between this measure and our measures of 
involvement in the schools. 

13. While the two-stage results are the technically correct ones, we 
should also note that these findings are robust with a simpler methodology. 
Using a one-stage model, the results are essentially the same. 
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14. We should also note that, for both urban and suburban districts, 
parents who chose to send their children to private schools are significantly 
more likely to engage in all of these social capital building activities than 
public school parents and more so than even active public choosers, with the 
exception of PTA involvement. This result is not surprising, and has been 
documented in the literature on private schools. 


Index 


=, nearly equal, 22 
|| ||, norm, 30 

L, orthogonal, 30 

|| , independent, 42 
~, distributed as, 68 


adj, classical adjoint, 32 

AIC, Akaike’s information criterion, 
210 

Aitken estimator; see feasible GLS estimator 

Alba and Logan, 106-107 

ambiguity in notation, 102 

American Cancer Society, 60 

Angrist, J., 208, 213 

Ansolabehere and Konisky, 204 

Arbuthnot, A., 94ff 

“as if by experiment’, “as if randomized”, 3, 
6-9, 92, 96, 98, 180, 190-191, 217; 
see also natural experiments, 
randomization, causal inference, 
causation 

as the crow flies, 52 

assignment equation, selection equation, 134, 
141, 143, 194 

association, 2ff; see also correlation 


vs causation, 2ff, 12-13, 17, 53, 134, 
207 
assumptions 
for GLS and FGLS, 61-63 
for IVLS, 181-182 
for logit model, 128 
for MLE, 118-119, 121, 149 
for OLS, 42, 49, 61-62 
for probit model, 121-124 
asymptotic covariance matrix 
for FGLS, 64, 161-166, 167-172, 
175 
for IVLS, 183, 198 
for MLE, 118-119, 123, 149, 150 
asymptotic mean 
for IVLS, 198 
for MLE, 118-119, 149 
asymptotic normality, 39, 59, 66, 70, 73, 
118-119, 149, 198 
asymptotic SEs 
compared with bootstrap SEs, 160-166, 
167-172; see also asymptotic 
covariance matrix, plug-in SEs, SE 
as square root of diagonal element 
in covariance matrix 


432 


asymptotic variance; see asymptotic 
covariance matrix, variance of 
random variable as diagonal 
element of covariance matrix 
asymptotics, 39, 59, 70, 73, 79, 118-119, 
123, 149, 164-166, 175, 198, 211 
compared with bootstrap, 160-166, 
167-173 
autoregression, 160-166, 174 
bootstrap principle for, 161 
average, 19 
average treatment effect, 132 


B, parameter vector, 41ff 
È, Bots, OLS estimator, 42ff 
Brcts, feasible GLS estimator, 64, 161-166, 
167-172, 174-175 
Bois, GLS estimator, 64, 65 
Busts, two-stage least squares estimator, 
i 176ff, 186 
PivLs, instrumental-variables least squares 
estimator, 176ff, 183ff 
Bayesian methods, 210-211 
Beck, N., 80, 148, 175, 280 
Berkson, Joseph, 2 
bias, 5, 43, 53, 59, 64, 66, 68, 92, 112, 119, 
124, 125-126, 130, 134-136, 
139-140, 149, 160-166, 167-173, 
174-175, 176ff, 184, 189-192, 
195-196, 197, 270, 271 
in autoregression, 160-166, 167-172, 
174 
due to endogeneity; see endogeneity 
due to failures in assumptions, 53, 59, 
68, 124, 139-140, 189-192, 
195-196 
in FGLS, 64, 66, 161-166, 167-172 
in IVLS, 184, 197—198 
in MLE, 119, 124, 149, 270, 271, 305 
omitted-variables, 59 
selection, 92, 130ff, 193—196 
due to simultaneity; see endogeneity 
small-sample, 184, 197-198 
bias-variance tradeoff, 150-151 
binary variable, 103 
binomial distribution, 116, 118, 119, 
125-126, 269 
bivariate normal density, ii, 38, 137 
bivariate probit model, 134-138 


STATISTICAL MODELS 


Blau, P. M., 81, 101 
Blau and Duncan, 81ff, 101, 302-303 
blinding, 14 
BLUE, best linear unbiased estimator, 61—63, 
64, 78 
BML, body mass index, 112, 152 
bootstrap, 155ff, 211 
compared with asymptotics, 166, 
167-173 
compared with plug-in methods, 
167-175 
to estimate bias, 160-166, 167—172, 
174-175 
estimator, 156, 158 
parametric, 166-167 
replicates, 156-157 
sample, 156 
SE, 157, 161, 167-175 
bootstrap principle 
for autoregression, 161 
for FGLS, 165-166 
for regression, 159 
for sample mean, 157 
bootstrapping RDFOR, 167-175 
bootstrapping the bootstrap, 173 
box model, 26, 99, 156, 158, 169 
Braithwaite, B., 94ff 
breast cancer, 3, 4-5, 15 
and mammography, 4-5, 15, 200-201 
and telephones, 3 
butter market, 176ff 


xd, chi-squared with d degrees of freedom, 
68 
cancer and smoking, 2-3 
cancer and vitamins, 17 
cancer survivors, 200 
categorical variable, 104-105 
vs quantitative variable, 104-105 
Catholic schools, effectiveness of, 130ff 
causal inference, 1—17 
and constancy assumptions, 13, 91—102, 
209ff 
from non-experimental data, 1ff, off, Off, 
81ff, 88ff, 91ff, 130ff, 176ff, 187ff, 
193ff, 209ff 
from observational data; see causal 
inference from non-experimental 
data 


INDEX 


qualitative, quantitative, 11, 23, 89, 97, 
101, 102 
and regression, 9ff, 81ff, 88ff, 91ff, 
176ff, 187ff, 193ff, 209ff 
causation, 114, 209ff 
vs association, 2ff, 12-13, 17, 53, 134, 
207 
as shown by controlled experiments, 
1-5, 17, 22-23, 109-110, 144-145 
as shown by logit models, 153-154 
as shown by natural experiments, 6—9 
as shown by path models, 81—86, 88-93, 
94-102, 113-114, 209ff 
as shown by probit models, 130-140 
as shown by regression models, 1, 9-13, 
81-86, 88-90, 91-102, 105-108, 
176ff, 187ff, 193ff, 209ff 
causation, manipulationalist and 
non-manipulationalist views, 114, 
214, 215 
centering residuals before bootstrapping, 174 
central limit theorem, 38-39, 59, 70, 154, 
198, 211, 243-244, 251, 257 
as inapplicable to GLS or FGLS, 66 
chi-squared distribution, 68 
cholera, 6-9, 16 
cigarettes; see smoking 
citations, model for determinants of, 
107-108 
cloglog specification, 147 
cofactor in a matrix, 32 
Coleman, James S., 139-140, 151 
collinearity, 55, 59, 301-302 
exact, 55 
column vector, 29 
computerized tomography, CT scans, 200 
concave function, 129, 177, 270, 273-274, 
282; see also convex function 
concomitants, as non-manipulable variables, 
192 
conditional probability, expectation, 28 
confounding, 2—4, 5, 11-12, 17, 42, 52, 92, 
94, 130ff, 187ff, 193ff 
in experiments, according to Victora 
et al, 112 
confounding variables 
controlling for, 3—4, 11, 130ff, 187ff, 
193ff; see also regression 
consistency, consistent estimator, 59, 198, 
199-200 


433 


constancy assumptions; see intervention 
constancy under intervention; see 
intervention 
Consumer Price Index, 60 
continuity correction, 244 
continuous variable, 104-105 
vs discrete variable, 104-105 
control, 2-5 
control group, 2-5 
vs treatment group, 2-5 
control variable; see covariate 
convex function, 28, 177, 243-274, 270, 
282 
defined, 28; see also concave function 
Cornfield, Jerome, 15, 153-154 
coronary heart disease, risk factors for, 
153-154 
correlation, 3 
coefficient, 19-21, 23 
coefficient for random variables, 35 
spurious, 3, 53, 56, 60 
cov, covariance, 27; see also covariance 
matrix 
covariance matrix 
for FGLS estimates, 64, 161ff, 167ff 
for GLS estimates, 64 
as having variances on the diagonal, 46 
for IVLS estimates, 183, 198 
for MLE, 118-119, 149-150 
for OLS estimates, 45ff 
for random vectors, 35 
covariate, 42, 192 
critical value of a test, 70, 73, 299-300, 309 
defined, 300 
cross-validation, 75 
cross-tabulation, 1, 3 
vs modeling, 138 
crossover, 15 
Current Population Survey, 82, 83, 105, 113, 
149, 305 


ôi, disturbance, random error, 82, 89, 98, 179, 
181, 189, 192, 194 
data snooping, 74-75, 79 
data variable, 18ff 
mean of, 19 
vs random variable, 24—25 
standard deviation of, 19, 25 
variance of, 19, 25 
death penalty, determinants of, 146-147 
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degrees of freedom, 46—48, 68, 73, 85, 150 
demand curve, 176ff 
convexity of, 177 
as response schedule, 177ff 
demand, determinants of, 178ff 
density 
of jointly normal random variables, 38 
of a normal random variable, 38 
of a random variable, 36 
dependent variable; see response variable 
design matrix, 41ff 
det, determinant of a matrix, 31-32 
determinants of demand, 178ff 
determinants of supply, 178ff 
deviance, 150 
diagnostics, 210, 246, 297 
diagonal matrix, 36 
diet and cancer, 17 
DiNardo and Pischke, 208 
direct effects, 95ff 
discrete variable, 104-105 
vs continuous variable, 104-105 
discrimination against women, a statistical 
model for, 103—104 
disturbance term, disturbances, 22, 41ff 
assumptions on, 22, 41-42, 61-62, 
63-64, 91ff, 98ff, 182ff 
in autogression, 160 
vs residuals, 23—24, 44—45, 49, 53, 57 
Doll, Richard, 2 
dot product, 30 
dummy variable, 103-104, 113, 121, 
130-131 
and interactions, 138—139 
Duncan, Otis Dudley, 81, 114, 188 


E, expectation, 24 
ei, see residual 
€i, disturbance, random error, 22, 41, 61, 83, 
87, 93, 98, 179, 189, 192 
EC/IC Bypass Study, 14 
economic growth, in relation to left-wing and 
trade-union power, 147-148, 271 
Edgeworth, F. Y., 148 
education 
and fertility, 187ff, 306-308 
and PTA membership, 193ff 
educational level 
father and son, 87ff 
husband and wife, 105 
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eigenvalue, eigenvector, 36 
email volume, determinants of, 205—206 
empirical covariance matrix, 65, 159, 164, 
171 
empirical distribution of a sample, 157-158 
endogeneity and exogeneity 
described, 59, 92, 96, 98, 134, 136, 
179-182, 198 
exogeneity assumptions behind causal 
inference, 92, 96-102, 136, 180, 
182, 188ff, 195ff 
mathematical examples, endogeneity 
bias, 54, 55-56, 59, 92, 112, 
198-199, 206, 207 
practical examples, endogeneity bias, 
139-140, 176ff, 189-192, 
195-196, 209ff 
statistical models for endogeneity, 
134-137, 176ff 
tests for exogeneity, 288 
endpoint maximum, 117 
epidemiology, 2ff, 15 
error function, 40 
error term; see disturbance term; see also 
residual 
error term compared with latent variable, 
123-124 
estimability, 125-127, 150 
vs identifiability, 125-127, 150 
estimate 
vs parameter, 23-24, 49, 57, 111 
estimators 
bias in, 53, 64, 68, 119-120, 160-166, 
167ff, 184, 197-200; see also bias 
due to failures in assumptions 
consistent, 59, 198, 200, 211 
FGLS, 64ff, 161ff, 167ff 
GLS, 63ff 
IISLS, 2SLS, 186 
inconsistent, 72, 199-200 
IVLS, 181ff, 306-308 
MLE, 115-124, 128ff, 303-306 
OLS, 9-13, 22-23, 34, 41 ff, 295ff 
unbiased, 43, 46, 61-63, 63-64, 92, 
102-103, 109-110, 125-126, 269, 
270, 271 
Evans and Schwab, 130ff, 141-142, 151, 217 
data issues, 151 
modeling issues, 138-140 
exchangeable variables, 109 
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exclusion restrictions; see identifying 
restrictions 
exogeneity, exogenous variable; see 
endogeneity and exogeneity 
expectation, expected value, 24 
compared with mean, 24 
conditional, 28 
experiments, 1-5, 14-17 
flawed, 14 
gedanken, hypothetical; see thought 
experiments 
unnecessary with large effects, 17 
experiments vs observational studies, 2-5, 
14, 17 
explained variance, 51-53 
as related to F, 74 
explanatory variable, 41ff 
exponential families, 119, 149, 269 
exposed group; see treatment group 
eyewitness testimony, 201 


¢, standard normal density, 38 
®, standard normal distribution function, 
121ff 
F-test, statistic, distribution, 59, 72-74, 
300-301 
fertility and education, 187ff, 306-308 
FGLS, bootstrap principle for, 165-166 
FGLS, feasible GLS, 64-67, 161-166, 
167-173, 174-175 
assumptions, 63ff 
estimator, 64-67 
one-step, 66, 161—166, 167-173 
two-step, 66 
Fisher, R. A., 2, 15, 72-73, 118-119, 123, 
150, 257 
Fisher information, 118-119, 123, 150 
fitted value, 47 
fixed-effects models, 67, 78 
compared with random-effects models, 
263 
flag, 103 
foreign investment, effects on political 
oppression, 105-106 
fraction of variance explained by regression, 
50-53 
Framingham heart study, 153-154 
free arrow, 83—86, 88, 94, 102 
Frisch-Waugh theorem, 241 
full rank, 32, 41, 126-127, 129, 182, 184, 240 
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and identifiability, 126-127, 129, 184 
vs rank deficient, 32 


Garrett, G., 147-148, 152, 280 
Gauss, Carl Friedrich, 9, 21, 34, 62 
Gauss’ theorem 
for multiple regression, 34 
for the regression line, 21 
Gauss-Markov theorem, 62, 269 
Gibson, J. L., 88ff, 101, 105, 303 
Giffen, Robert, 148—149 
Gilens, 205 
GLS, generalized least squares, 63ff, 182 
assumptions, 63 
estimator, 63-64 
estimator is conditionally unbiased, 64 
model, GLS model; see GLS 
assumptions 
Goldberger, Joseph, 16 
Goldthorpe, J., 214 
goodness of fit, 53 
vs model validity, 53, 111, 207, 268, 
292-293 
Gram-Schmidt process, 69 
graph of averages, 20-21 


H, hat matrix, 47 
happiness, determinants of, 203 
hat notation for estimators, 22 
Heckman, J. J., 151, 213 
heights of fathers and sons, 18-21, 23 
Hendry, D., 60, 212, 213 
Henschke et al, 200 
heteroscedasticity, 66, 78, 146-147, 279-280 
Hill, Bradford, 2 
HIP, Health Insurance Plan of New York, 
4-5, 13, 15 
homoscedasticity, 60, 66 
Hooke’s law, 22—23, 28, 43-44, 87, 91-93 
HRT, hormone replacement therapy, 17, 144, 
152 
HS&B, High School and Beyond, 130, 136, 
139-140, 151 
Huber-White correction for 
heteroscedasticity, 78 
hypothesis testing, 68-74, 299-300 
critical value, 68, 70, 73, 300 
deviance, 150 
F-test, 72ff, 300-301 
level, 300 
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hypothesis testing (cont.) 
power, 300 
score test, 150 
significance levels, 70, 299-300 
size, 300 
statistical significance, 70, 300 
t-test, O8ff, 298-299, 308-309 
z-test, 68 

hypothetical experiments; see thought 

experiments 


Ig; see Fisher information 
Inxn, the n x n identity matrix, 30 
idempotent projection matrices, 47-48, 186 
identifiability, 125-127, 135-136, 150-151, 
182, 213 
vs estimability, 125-127, 150 
identifying restrictions, 135-136, 140, 
189-192, 196, 209ff 
IEEE arithmetic, 295 
IID, independent and identically distributed, 
22, 24, 39, 42, 50, 68ff, 91ff, 118, 
123ff, 155ff, 181 ff, 209ff 
IISLS, 2SLS, two-stage least squares, 176, 
181, 186 
relation to OLS, 181, 186; see also 
IVLS 
inconsistent estimator, 200 
independent and identically distributed; see 
IID 
independence assumptions as basis for 
computing SEs, 45—46, 54, 57, 
59-60, 77 
independent effects, 70 
independent variable, 41 ff 
independence vs orthogonality, 42, 244 
indicator variable, 103 
indirect effects, 95ff 
Indonesia, 202—203 
information, information matrix; see Fisher 
information 
information, observed, 119, 123, 131, 
303-306 
inner product, 30 
instrument, instrumental variable, 135, 176, 
181-185, 191, 195, 197-200 
instrumental-variables least squares; see 
IVLS 
intention-to-treat, 5, 15 
vs per protocol, 235 
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vs treatment received, 5, 252 
interactions, 138-139, 143-144, 147-148 
of left-wing and trade-union power, 
147-148, 280 
of TV ads with other variables in 
election model, 143-144 
intercept of regression line, 20, 50 
intermediate variables, 95ff 
International Agency for Research on Cancer, 
15 
intervention, 13, 86, 87, 91-102, 114, 187, 
190-191, 196, 209ff 
vs observation, 2, 13, 101, 191 
vs selection, 101; see also manipulation 
invariance assumptions; see intervention 
invariance under intervention; see 
intervention 
invertible matrix, 32 
iteratively reweighted least squares, 66 
IVLS, instrumental-variables least squares, 
181ff, 197-200, 306-308 
assumptions, 181—182 
asymptotic normality, 198 
asymptotic variance, 183, 198 
bias in, 184, 197-198 
consistency, 197-198 
model, IVLS model; see assumptions 
relationship with OLS, 185, 186, 
197-198 
simulations for, 199-200; see also 
IISLS 


Jacobs and Carmichael, 146-147 

Jensen’s inequality, 270 

jointly normal random variables, 
38-39 

just-identified system, 182 


Kahneman, D., 213-214 

Keefe et al, 14 

Keynes, John Maynard, 212 

King, Keohane, and Verba, 204, 
216 

Koch, Robert, 6, 16 

Krueger, A., 185, 208, 213 


A, logistic distribution function, 128 
logit model defined by, 128 

Ln, log likelihood function, 116ff 

Labrie et al, 58—59, 252 
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lag, lag term, in autoregression, 160-161, 
161-166, 167-172, 174-175 
latent variable, 123-124, 128, 132-137, 
139-140, 195 
vs error term, 124 
law of 
diminishing marginal utility, 177 
error, normal, 148 
large numbers, 14, 198, 236, 251, 252, 
267, 268 
supply and demand, 177ff, 191 
lead time bias, 288 
least squares, 9-13, 22-23 
iteratively reweighted, 66 
weighted, 65, 67; see also regression, 
OLS 
left-wing political power, 147-148, 280 
Legendre, Adrien Marie, 9 
Lehmann, Erich, 73, 149, 217, 258 
level of a test, 300 
levels of measurement, 114 
Lieberson, S., 214-215 
likelihood function, 1 16ff 
linear probability model, 123, 193-196 
log likelihood function, 116ff 
log odds ratio, 128 
logistic curve, history of, 153-154 
logistic distribution function, 128 
monotonicity of, 128 
symmetry of, 128 
logistic regression, 128 
history of, 153-154 
logit, 128 
logit model, 128-129, 149, 153-154, 
305-306 
lung cancer death rate, 53, 56, 60, 252 
lung cancer screening, 200 
Lu-Yao and Yao, 201, 289 


malaria, 146, 279 

Malthus, Thomas, 153, 189 

Malthusian population theory, 153 

Mamaros and Sacerdote, 205—206 

mammography, 4-5, 15, 200-201 

manipulation, 86, 91 ff, 94ff, 114, 209ff 

vs observation, 2, 13, 101-102; see also 

intervention 

manipulationalist and non-manipulationalist 
views of causation, 114, 214-215 

marathons, 201 
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marginal effects, 131-132 
matrix, 29 
addition, 29 
classical adjoint, 32 
covariance, 35 
design, 41ff 
determinant of, 31-32 
diagonal, 36 
fixed, 42 
identity, 30 
inverse, 31-32 
invertible, 32 
multiplication, 30 
non-negative definite, 36-37 
orthogonal, 36 
positive definite, 36-37 
positive semi-definite, 36-37 
random, 42 
rank, 32 
symmetric, 30 
trace, 30, 33 
transpose, 30 
unitary, 36 
zero, 30 
maximum likelihood estimator; see MLE 
McCarthyism, 88ff, 101 
mean 
of data variable, 19, 25 
of random variable, 24-25 
of sample, 24 
measurement error, 112—113, 204, 290 
median, 28 
mediating variables; see indirect effects 
Meehl, P., 17, 215 
Megawati, 202-203 
microbiology, 16 
Mill, John Stuart, 17 
misspecification, 147, 149, 175, 209ff, 
279-280 
MLE, maximum likelihood estimator, 115ff, 
128ff, 130ff, 149-150, 303-306 
assumptions, 118—119, 121-124, 149 
assumptions, consequences of failures 
in, 124 
asymptotic normality, 118-119, 123, 
149 
behavior with small samples, 119—120, 
305 
bias in, 119, 120, 269-271, 305; see also 
bias due to failures in assumptions 
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MLE, maximum likelihood estimator (cont.) 
in binomial model, 116-117, 118, 119 
compared with OLS in normal model, 

120, 271 
consistency, 118-119, 123, 149 
in logit model, 128, 291-292 
in normal model, 115—116, 119-120 
in Poisson model, 117 
in probit model, 121 ff, 129, 130ff 
model selection, 204—205, 209ff 

MSE, mean square error, 21 

multicollinearity; see collinearity 

multiple comparisons; see data snooping 

multiple regression, 9-13, 26, 34, 41ff 

multivariate normal, 38—39 


N (u, 02), normal distribution, 38 
natural experiments, 1, 6-9, 213 
NELS, National Educational Longitudinal 
Surveys, 151 
Newton-Cotes method, 152-153 
Newton-Raphson method, 152-153 
Neyman, Jerzy, 150, 217 
Neyman-Pearson statistic, 150 
nominal SEs, nominal variances, 80, 175; see 
also asymptotic SEs, plug-in SEs 
non-experimental data; see observational data 
non-negative definite matrix, 36-37 
non-parametric methods, 258 
non-singular matrix; see invertible matrix 
norm, 30 
normal 
central limit theorem and the, 39 
density of the, 38, 124, 129, 137 
distribution function of the, 121ff, 129, 
130ff 
as exponential family, 119, 269 
joint normality, 38-39 
MLE in, 115-116 
moments of the, 80 
parametric bootstrap for the, 166-167 
probit model defined by the, 121-124 
random variables, 38-39 
regression models with errors that are, 
68ff 
standard, 38 
tail bounds for the, 129 
null hypothesis, 68, 72-73, 300 
mistakes formulating hypotheses, 71, 
111-112 
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mistakes interpreting significance levels, 
71 
Nurses’ Health Study, 144, 152 


observation vs manipulation, 2, 13, 101—102; 
see also intervention 
observational data, 1ff 
causal inference from, 1—4, 6-9, 9-13, 
16-17, 81 ff, 88ff, 91 ff, 94ff, 130ff, 
176ff, 187ff, 193ff, 209ff 
observational studies, 1ff; see also 
observational data 
vs experiments, 1-5, 13, 17 
observed information, 119, 123, 131 
observed significance levels, 70, 300 
defined, 300 
mistaken interpretations, 71-72, 
111-112 
observed value, 25, 42 
vs random variable, 25 
occupational status, 81—86, 187—192 
odds ratio, 128 
OLS, ordinary least squares, 34, 41ff 
assumptions, 41-42, 61-62 
assumptions, consequences of failures 
in, 53, 59, 80 
asymptotic normality, 59, 299 
compared to IVLS, 186, 197-198; see 
also regression, GLS, IVLS, path 
diagrams 
OLS estimator, 34, 42ff 
bias in; see OLS assumptions, 
consequences of failures in 
conditional variance, 45—46 
conditionally unbiased, 43 
consistency, 59 
OLS model, regression model; see OLS 
assumptions 
omitted-variables bias, 59 
one-step GLS, 65—66, 163ff, 167ff 
as problematic according to a social 
scientist, 77—78 
orthogonal, 30 
orthogonal matrix, 36 
orthogonality vs independence, 42, 244 
orthonormal, 39 
outer product, 31 
out-relief, 9-13, 45, 57, 148-149, 203, 
297-301 
over-identified system, 182 
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P-values, 70, 300; see also hypothesis 
testing, observed significance levels 
Pacini, F., 16 
pain control, 14 
parameter, 22 
vs estimate, 23—24, 49, 57, 111 
parametric bootstrap, 166-167 
Pasteur, Louis, 6, 16 
path diagram, 81ff 
as a causal model, 81—86, 88—90, 91-93, 
94-102, 107-108, 113 
complete, 113 
represented as a box model, 99-100 
path model; see path diagram 
pauperism, 9-13, 45, 57, 148-149, 203-204, 
297-301 
Pearson, Karl, 19, 23, 154 
Pearson and Lee, 19, 23 
per protocol analysis, 235; see also 
intention-to-treat, treatment 
received 
permeability of social structure, 81, 85-86 
Pettenkofer, Max von, 16 
philosophers’ stones, 211 
Pisano et al, 200-201 
plug-in 
SEs, 64, 160ff, 167ff, 175 
SEs, compared with bootstrap SEs, 
161 ff, 167ff 
variance estimators, 46, 64, 119, 164, 
171 ff, 183, 198; see also 
asymptotic SEs, nominal SEs 
Podunk University, 266 
Poisson distribution, 117, 118, 119, 120, 269 
policy preferences, as related to political 
knowledge, 205 
political oppression, as related to foreign 
investment, 105-106 
population forecasting, modeling, 153 
positive definite matrix, 36-37 
positive semi-definite matrix, 36 
potential outcomes, 99, 213 
poverty, causes of, 9-13, 45, 57, 148-149, 
297-301 
power of a test, 300 
Powers and Rock, 142-143, 151 
predicted value; see fitted value 
prediction, 1, 13 
vs description, 1, 13 
presidential elections and TV ads, 143-144 
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probit model, 121-124, 129, 130ff 
bivariate, 134 ff 
product 
inner, 30 
outer, 31 
projection theorem, 241 
prostate cancer, PSA, 58-59, 201-202, 252, 
289 
PTA membership, determinants of, 193—196 
publication productivity, determinants of, 
107-108 
Pythagoras’ theorem, 33, 51-52 


quadrature, 137, 152-153 

qualitative vs quantitative causal inference, 
11, 23, 89, 97, 101, 102 

qualitative vs quantitative variable, 104-105 

Quetelet, Adolphe, 11, 16 


r, correlation coefficient, 19—21 
R, statistical package, 309 
R2, fraction of variance explained, 50-53 
for equations without an intercept, 75 
as measuring association rather than 
model validity, 53, 56-57, 111, 
207, 292-293 
as related to F, 74 
as related to r, 53 
random-coefficients models, 209; see also 
random-effects models 
random-effects models, 78, 263 
compared with fixed-effects models, 263 
random error; see disturbance term 
vs residual; see disturbance term vs 
residuals 
random matrix, 42 
random number generators, 298 
random variables, 14, 24—27, 28 
correlation, 35 
covariance, 35 
density, 36 
expectation, expected value, 24 
independent, 24 
jointly normal, 38-39 
mean; see expectation 
normally distributed, 38-39 
observed values of, 25, 26, 42 
realizations of; see observed values 
relationship to data, 24-25, 42, 44 
relationship to samples, 24 
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random variables (cont.) 
standard error, 25 
variance, 24, 25, 35-36 
randomization, 1—5, 14, 17, 109-110, 
144-146 
as basis for causal inference, 1—5, 
109-110, 144-146; see also “as if 
randomized” 
randomized controlled experiments; see 
randomization 
rank, 32 
deficient, 32 
full, 32, 41 ff, 182ff, 240 
as related to identifiability, 41ff, 
125-127, 182ff, 271-272 
rational choice theory, 213-215 
raw; see unstandardized regression 
coefficients, unstandardized 
variables; see also standardization, 
standardized regression coefficients 
RDFOR, Regional Demand Forecasting 
Model, 167ff 
reading, determinants of, 121—124, 149 
realization; see observed value 
Redelmeier and Greenwald, 201 
regression, 1ff, Off, 18ff, 41ff 
to control for confounding variables, 
9-13, 187-192, 193-196 
diagnostics; see diagnostics 
effect, 236 
line, 18-23 
as a model for causation, 1, 9-13, 
22-23, 81-86, 87-90, 91-93, 
94-102, 105-111, 113-114, 
176-181, 187-192, 193-196, 209ff 
multiple vs simple, 26 
uses of, 1, 13 
regression, bootstrap principle for, 159 
regression, uses of, 1, 13 
description, 13 
prediction, 13, 17 
to infer causation; see regression as a 
model for causation 
regression coefficients 
standardized, 86, 87 
unstandardized, 86, 87 
regression diagnostics; see diagnostics 
regression equation, 9-13, 22-23, 73, 81 ff, 
Ol ff, 94 ff 
vs structural equation, 101—102 
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regression line 
as flatter than SD line, 20-21 
as linear approximation to graph of 
averages, 20-21 
regression model, 9-13, 22-23, 41ff, 61-68 
for causation; see regression as a model 
for causation 
vs fitted model, 23—24, 44-45, 49, 
53-54, 57; see also OLS model, 
OLS assumptions 
religious coping, 14 
rent control, 176-178 
replication, 79 
repression in the McCarthy era, 88-90, 
303 
residential integration, determinants of, 
106-107 
residual, 21, 23—24, 42-43 
vs random error; see disturbance term vs 
residuals 
residual variance; see unexplained variance 
response schedule, 13, 91-102, 113, 
133-136, 176-181, 187, 189-192, 
193-196, 217 
response variable, 41ff 
Rindfuss et al, 187ff, 217—218, 306-308 
RMS, root mean square, 21 
robust to misspecification, 146—147 
Rodgers and Maranto, 107—108, 266 
root mean square error, 21 


Sx, sample standard deviation, 19 
sample 
mean, 19 
mean, as a random variable, 24 
variance, 19 
variance, as a random variable, 24 
sample mean, bootstrap principle for, 157 
SAT, effects of coaching on, 93—94, 142-143, 
151 
scatter diagram, 18ff 
Schneider et al, 193-196, 217-218 
school choice, effects of, 130-140, 193-196 
score test, 150 
screening; see lung cancer screening, 
mammography, prostate cancer 
SD, standard deviation, 18—19, 25 
SE, standard error, 25 
for slope and intercept of regression line, 
50 
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as square root of diagonal element in 
covariance matrix, 46; see also 
covariance matrix 
SE vs SD, 25 
selection bias, 92, 130, 134ff, 193-196 
selection equation; see assignment equation 
selection vs intervention, 101—102 
self-selection, models for, 130, 134ff, 
193-196 
Semmelweis, Ignaz, 16 
Sen, A. K., 214 
Shaw, D. R., 143-144, 277-278 
significance, statistical; see significance 
levels 
significance levels, 68ff, 300 
barely significant, statistically 
significant, highly significant, 70 
mistakes interpreting, 71-72, 111-112; 
see also hypothesis testing, 
observed significance levels, 
P-values 
simple vs multiple regression, 26 
Simpson’s rule, 152 
simultaneity bias; see endogeneity 
simultaneous-equation models, 176ff, 209ff, 
306-308 
size of a test, 300 
slope of regression line, 20, 50 
small-sample bias, 184, 197-198 
smoking, health effects of, 2-3, 15 
Snow, John, 6-9, 16 
social capital, 193-196 
social physics, 11, 16, 86, 89, 194 
social status, social stratification, 81ff, 101, 
188 
sparse cross-tabs, handled by modeling, 
138 
specification, specifying a model, 128, 149, 
178-179, 193, 209ff 
specification error, 147, 149, 175, 308-309 
specification tests, 204-205, 210ff; see also 
diagnostics 
spectral theorem for matrices, 36-37 
stability under intervention; see intervention 
standard deviation 
data variable, 19, 25 
random variable, 25 
standard error; see SE 
standard normal density, 38 
standard units, 20 
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standardization, 20, 82ff, 86, 87, 89ff, 90-91, 
113, 263 
standardized regression coefficients, 86, 87 
statistical modeling, issues in, 209ff 
statistical packages, 309 
as replacing statistical tables, 309 
Stouffer survey, 88—90 
stratification (cross-tabulation), 1-4, 138; 
see also social stratification 
structural equations, 101—102, 187, 190-192, 
213-215 
vs regression equations, 101—102 
structural zeros, 136 
student’s t-test, statistic, distribution, 68—70, 
298-299, 308-309 
supply, determinants of, 178ff 
supply curve, 176ff 
as a response schedule, 178ff 
concavity of, 177 
symmetry 
of logistic distribution, 128 
of matrices, 30 
of normal distribution, 120 
of projection matrices, 47, 186 


2SLS, IISL, two-stage least squares, 176, 
186; see also IVLS 

t-test, statistic, distribution, 68—70, 298-299, 
308-309 

telephones and breast cancer, 3, 15 

thought experiments, 95ff, 114, 190-192, 
209ff 

Timberlake and Williams, 105—106 

Tinbergen, Jan, 212 

tolerance of dissent, 88ff, 303 

trace of a matrix, 30 

trade-union power, 147—148, 280 

traffic fatalities, 201 

transpose of a matrix, 30 

trapezoid rule, 152 

treatment group, 2-5 

vs control group, 2-5 
treatment received, 5 
vs intention-to-treat, per protocol, 5, 

235, 252 

TV ads and presidential elections, 143—144 

Tversky, A., 213-214 

Two-County Study on mammography, 5 

two-equation model for effects of Catholic 
schools, 134ff 
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two-stage least squares; see IISLS 
two-step GLS, 66 


unbiased estimators, 43, 46-48, 58, 61-64, 
78, 92, 102-103, 109-110, 
119-120, 125-127, 269-270 

unconditional vs conditional expectation, 59 

under-identification, 125-127, 136, 150-151, 
182, 209ff 

unexplained variance, 51-52 

unitary matrix, 36 

University, Podunk; see Podunk 
University 

unmodeled heterogeneity, 204 

unstandardized regression coefficients, 
variables, 86, 87 


var, variance, 19, 24-25, 27, 35 

of data, 19, 25 

of estimates in a simple regression 
model, 50 

explained, unexplained, 51-52 

of FGLS estimates, 64, 65—66, 161 ff, 
167ff, 174-175 

of GLS estimates, 64 

of IVLS estimates, 183, 197—198 

of MLE, 118-119, 123, 149-150 

of OLS estimates, 45—48, 61—62 

of random variables, 24-25, 35 

of random variables as diagonal 
elements of the covariance matrix, 
35 

of samples, 24 

variable 

binary, 103 

categorical, 104-105 

continuous, 104-105 

dependent, 41 ff 

discrete, 104-105 


STATISTICAL MODELS 


dummy, 103-105, 113 
endogenous; see endogeneity and 
exogeneity 
exogenous; see endogeneity and 
exogeneity 
explanatory, 41ff 
independent, 41 ff 
indicator, 103 
instrumental, 176, 181-186, 191—192, 
193-196 
latent, 123-124, 126-127, 128, 
132-133, 134-137 
non-manipulable, 114, 192, 196 
qualitative, 104—105 
quantitative, 104-105 
random, 14, 24—27 
response, 41 ff 
vectorizing code, 301 
Verhulst, P. F., 153 
vitamins and cancer, 17 
voter turnout, determinants of, 204 


weighing designs, 108—109, 112 

weighted least squares, 65 

White, H., 78 

White’s correction for heteroscedasticity, 78, 
146-147, 175 

Wilks’ statistic, 150 


x, sample mean, 19 

X, generally the design matrix, 41ff; 
sometimes, a random variable or 
vector, 14, 94, 102 


Yule, G. U., 9-13, 16-17, 60, 148-149, 
203-204, 297-301 


Omxn, an m x n matrix of zeros, 30 
z-test, 68 


